2.2 Vision-based hand gesture representations

121
PROYECTO FIN DE CARRERA ıtulo: Development of a Hand-Gesture Recognition System for Human-Computer Interaction Autor: Ana Isabel Maqueda Nieto Tutor: Carlos Roberto del Blanco Ad´ an Departamento: Se˜ nales, Sistemas y Radiocomunicaciones MIEMBROS DEL TRIBUNAL Presidente: Narciso Garc´ ıa Santos Vocal: Fernando Jaureguizar N´ nez Secretario: Carlos Roberto del Blanco Ad´ an Suplente: Carlos Cuevas Rodr´ ıguez FECHA DE LECTURA: 23 de septiembre de 2014 CALIFICACI ´ ON: 10

Transcript of 2.2 Vision-based hand gesture representations

Page 1: 2.2 Vision-based hand gesture representations

PROYECTO FIN DE CARRERA

Tıtulo: Development of a Hand-Gesture Recognition System forHuman-Computer Interaction

Autor: Ana Isabel Maqueda Nieto

Tutor: Carlos Roberto del Blanco Adan

Departamento: Senales, Sistemas y Radiocomunicaciones

MIEMBROS DEL TRIBUNAL

Presidente: Narciso Garcıa Santos

Vocal: Fernando Jaureguizar Nunez

Secretario: Carlos Roberto del Blanco Adan

Suplente: Carlos Cuevas Rodrıguez

FECHA DE LECTURA: 23 de septiembre de 2014

CALIFICACION: 10

Page 2: 2.2 Vision-based hand gesture representations
Page 3: 2.2 Vision-based hand gesture representations

UNIVERSIDAD POLITÉCNICA DE MADRID

ESCUELA TÉCNICA SUPERIOR

DE INGENIEROS DE TELECOMUNICACIÓN

PROYECTO FIN DE CARRERA

DEVELOPMENT OF A HAND-GESTURE RECOGNITION SYSTEM FOR HUMAN-COMPUTER

INTERACTION

ANA ISABEL MAQUEDA NIETO

SEPTIEMBRE 2014

Page 4: 2.2 Vision-based hand gesture representations
Page 5: 2.2 Vision-based hand gesture representations

Development of a hand-gesture recognition systemfor human-computer interaction

Autor: Ana Isabel Maqueda NietoTutor: Carlos Roberto del Blanco Adan

Grupo de Tratamiento de ImagenesDpto. Senales, Sistemas y Radiocomunicaciones

E.T.S. Ingenieros de TelecomunicacionUniversidad Politecnica de Madrid

Septiembre de 2014

Page 6: 2.2 Vision-based hand gesture representations
Page 7: 2.2 Vision-based hand gesture representations

Resumen

El objetivo de este Proyecto Fin de Carrera es el estudio, diseno y desarrollo de una in-terfaz para la interaccion hombre-maquina, robusta y fiable, basada en el reconocimientovisual de gestos de la mano. Las funcionalidades implementadas estan orientadas a lasimulacion de un dispositivo clasico de interaccion hardware: el raton, mediante el re-conocimiento de un vocabulario predefinido de gestos de mano en secuencias de vıdeo decolor.

Para ello se ha disenado e implementado un prototipo de un sistema de reconocimientode gestos de manos formado por tres etapas: deteccion, seguimiento y reconocimiento. Estesistema esta basado en metodos de aprendizaje maquina y reconocimiento de patrones,que han sido integrados junto con otras tecnicas de procesado de imagenes para conseguiruna elevada precision en la tasa de reconocimiento y no alcanzar un coste computacionalmuy elevado.

Respecto a las tecnicas de reconocimiento de patrones, se han disenado e implementadodiversos algoritmos y estrategias aplicables a imagenes y secuencias de vıdeo en color. Eldiseno de estos algoritmos tiene como objetivo extraer caracterısticas espaciales y espacio-temporales de gestos estaticos y dinamicos, que los identifiquen de una forma precisa, yque sean robustos frente a perturbaciones en las imagenes.

Finalmente, se ha creado una base de datos que contiene el vocabulario de gestosnecesario para interaccionar con la maquina.

Palabras clave

Reconocimiento, gestos, manos, descriptor de imagen, descriptor de video, patrones, seg-mentacion, espacio-temporal, LBP, SVM, clasificacion.

Page 8: 2.2 Vision-based hand gesture representations
Page 9: 2.2 Vision-based hand gesture representations

Abstract

The aim of this Master Thesis is the analysis, design and development of a robust andreliable Human-Computer Interaction interface, based on visual hand-gesture recognition.The implementation of the required functions is oriented to the simulation of a classicalhardware interaction device: the mouse, by recognizing a specific hand-gesture vocabularyin color video sequences.

For this purpose, a prototype of a hand-gesture recognition system has been designedand implemented, which is composed of three stages: detection, tracking and recognition.This system is based on machine learning methods and pattern recognition techniques,which have been integrated together with other image processing approaches to get a highrecognition accuracy and a low computational cost.

Regarding pattern recongition techniques, several algorithms and strategies have beendesigned and implemented, which are applicable to color images and video sequences.The design of these algorithms has the purpose of extracting spatial and spatio-temporalfeatures from static and dynamic hand gestures, in order to identify them in a robust andreliable way.

Finally, a visual database containing the necessary vocabulary of gestures for interact-ing with the computer has been created.

Key words

Recognition, gestures, hands, image descriptor, video descriptor, patterns, segmentation,spatio-temporal, LBP, SVM, classification.

Page 10: 2.2 Vision-based hand gesture representations
Page 11: 2.2 Vision-based hand gesture representations

Agradecimientos

A mis padres, mi hermano y mis abuelos. Soy consciente del granesfuerzo que habeis hecho para que llegara hasta aquı y de todo loque me habeis dado. Gracias por vuestro apoyo y por hacer de mılo que soy.

A Fede. Gracias por tu carino, apoyo, paciencia y comprension,por hacerme sonreir y compartir cada dıa conmigo.

A los mejores companeros de viaje que haya podido tener. GraciasVir Pascual, Vir Martın, Mika, Sandra, Ali, Miguel, Alex, Dani yLuis, por haber vivido conmigo tantos momentos y haberme regaladovuestra amistad incondicional.

A todas aquellas personas que me han brindado su ayuda y amistad,y han hecho que el dıa a dıa haya sido mas facil y llevadero: amigosde Quintanar, companeros de Delegacion y companeros del GTI,especialmente mi tutor Carlos.

Page 12: 2.2 Vision-based hand gesture representations
Page 13: 2.2 Vision-based hand gesture representations

Contents

Resumen I

Abstract III

Agradecimientos V

List of Figures IX

List of Tables XIII

Glossary XV

1 Introduction 1

1.1 Motivation and goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 State of the art 5

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Gesture definition and classification . . . . . . . . . . . . . . . 5

2.1.2 Enabling technologies for gesture recognition . . . . . . . . . . 7

2.2 Vision-based hand gesture representations . . . . . . . . . . . . . . . 11

2.2.1 3D model-based representation . . . . . . . . . . . . . . . . . 11

2.2.2 Appearance-based representation . . . . . . . . . . . . . . . . 12

2.3 Hand gesture recognition techniques . . . . . . . . . . . . . . . . . . . 14

2.3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.3 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Feature extraction techniques . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Image descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 21

Page 14: 2.2 Vision-based hand gesture representations

2.4.2 Video descriptors . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Description of the hand-gesture recognition system 37

3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Hand detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Hand tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Gesture recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 System training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.6 Virtual mouse application . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Image and video descriptors 51

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Image descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 Spatiograms of Local Binary Patterns . . . . . . . . . . . . . . 52

4.2.2 Local Binary Patterns based on Median and Pseudo-Covariance 54

4.3 Video descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Experimental results 61

5.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1.1 Environment and scene description . . . . . . . . . . . . . . . 61

5.1.2 Gesture vocabulary . . . . . . . . . . . . . . . . . . . . . . . . 62

5.1.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2.1 Binary classification . . . . . . . . . . . . . . . . . . . . . . . 66

5.2.2 Multi-class problem . . . . . . . . . . . . . . . . . . . . . . . . 69

5.3 Evaluation of image descriptors for the detection phase . . . . . . . . 69

5.4 Evaluation of the recognition phase . . . . . . . . . . . . . . . . . . . 75

5.4.1 Evaluation of video descriptors . . . . . . . . . . . . . . . . . 75

5.4.2 Evaluation of the temporal window . . . . . . . . . . . . . . . 83

5.4.3 Integration of the video descriptor and the temporal window . 85

5.5 Overall system results . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Contributions 91

Page 15: 2.2 Vision-based hand gesture representations

List of Figures

2.1 Psychological gesture taxonomy. . . . . . . . . . . . . . . . . . . . . . 7

2.2 Contact-based devices. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Vision-based devices. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Different vision-based hand models. . . . . . . . . . . . . . . . . . . . 13

2.5 Vision-based hand gesture recognition techniques. . . . . . . . . . . . . 19

2.6 First step to compute the SIFT descriptor (extracted from [1]). . . . . 23

2.7 Stages of keypoint selection(extracted from [1]). . . . . . . . . . . . . 24

2.8 SIFT feature description. Left: image gradients. Right: key pointdescriptor (extracted from [1]). . . . . . . . . . . . . . . . . . . . . . . 24

2.9 The processing chain of HOG feature descriptor. . . . . . . . . . . . . 25

2.10 Local binary pattern from a pixel neighborhood. . . . . . . . . . . . . . 26

2.11 Circularly symmetric neighbor sets for different P and R (extractedfrom [2]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.12 The 58 different uniform patterns with a LBP8,R configuration (ex-tracted from [3]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.13 The 36 different rotation invariant patterns with a LBP8,R configura-tion (extracted from [2]). . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.14 Framework of Complete Local Binary Patterns (extracted from [4]). . 31

2.15 Procedure to compute VLBP1,4,1. . . . . . . . . . . . . . . . . . . . . 34

2.16 Procedure to compute the LBP-TOPPXY ,PXT ,PY T ,RX ,RY ,RTdescriptor. . 35

2.17 Volumetric neighborhood of a pixel considering a LBP-TOP16,8,8,3,3,1

configuration (extracted from [5]). . . . . . . . . . . . . . . . . . . . . 36

3.1 Block diagram of the proposed hand-gesture recognition system. . . . . 38

Page 16: 2.2 Vision-based hand gesture representations

3.2 Block diagram of the hand detection phase. . . . . . . . . . . . . . . . 39

3.3 Stages in the detection phase. . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Block diagram of the hand-gesture recognition phase. . . . . . . . . . . 43

3.5 Sliding window approach to segment dynamic hand gestures. . . . . . 44

3.6 Feature extraction and gesture recognition for a given temporal window. 46

3.7 Sliding window approach to validate a prediction. . . . . . . . . . . . 47

3.8 Execution of the mouse functions based on the gesture predictions. . . 49

4.1 Spatiograms of Local Binary Patterns. . . . . . . . . . . . . . . . . . . 53

4.2 Local Binary Patterns based on Median and Pseudo-Covariance. . . . 56

4.3 LBP approach considering spatial information. . . . . . . . . . . . . . 57

4.4 Volumetric Descriptors based on Temporal Sub-sampling (VD-TS). . . 58

4.5 Image sequence sub-sampling and its parameters. . . . . . . . . . . . . 60

5.1 Environment and scene captured by the sensor. . . . . . . . . . . . . . 62

5.2 Calculation of the sliding window dimensions. . . . . . . . . . . . . . 66

5.3 Setting the dimensions of the sliding window. The right option ispreferable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 (a) Class distributions and decision threshold. (b) ROC curve. . . . 68

5.5 Fraction of misclassified samples for different parameter configura-tions of the LBP descriptor. . . . . . . . . . . . . . . . . . . . . . . . 71

5.6 Fraction of misclassified samples for different parameter configura-tions of the LBP-MPC descriptor. . . . . . . . . . . . . . . . . . . . . 73

5.7 Fraction of misclassified samples for different parameter configura-tions of the VLBP descriptor. . . . . . . . . . . . . . . . . . . . . . . 77

5.8 Confusion matrix and ROC curve for the best parameter configurationof the VLBP descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.9 Fraction of misclassified samples for different parameter configura-tions of the LBP-TOP descriptor. . . . . . . . . . . . . . . . . . . . . 80

5.10 Confusion matrix and ROC curve for the best parameter configurationof the LBP-TOP descriptor. . . . . . . . . . . . . . . . . . . . . . . . 82

5.11 Confusion matrix and ROC curve for the best parameter configurationof the VD-TS descriptor. . . . . . . . . . . . . . . . . . . . . . . . . . 84

Page 17: 2.2 Vision-based hand gesture representations

5.12 Confusion matrix and ROC curves for the video sequence that hasobtained the best recognition rate. . . . . . . . . . . . . . . . . . . . . 87

Page 18: 2.2 Vision-based hand gesture representations
Page 19: 2.2 Vision-based hand gesture representations

List of Tables

2.1 Comparison bewteen contact-devices and vision-devices. . . . . . . . . 10

4.1 Attributes of the S-LBP descriptor. . . . . . . . . . . . . . . . . . . . 54

5.1 Proposed hand gestures. Description and visual examples. . . . . . . . 64

5.2 Database overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 The confusion matrix for two-class classification problem. . . . . . . . 67

5.4 The confusion matrix for M-class classification problem. . . . . . . . . 69

5.5 Precision, Recall, and F-score metrics for different parameter config-urations of the LBP descriptor (all the parameters have been alreadyintroduced in previous sections). . . . . . . . . . . . . . . . . . . . . . 72

5.6 Precision, Recall, and F-score metrics for different parameter config-urations of the LBP-MPC descriptor (all the parameters have beenalready introduced in previous sections). . . . . . . . . . . . . . . . . . 72

5.7 Precision, Recall, and F-score metrics for different parameter configu-rations of the S-LBP descriptor (all the parameters have been alreadyintroduced in previous sections). . . . . . . . . . . . . . . . . . . . . . 74

5.8 Average accuracy for different parameter configurations of the VLBPdescriptor (all the parameters have been already introduced in previoussections). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.9 Average accuracy for different parameter configurations of the LBP-TOP descriptor (all the parameters have been already introduced inprevious sections). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.10 Average accuracy for different parameter configurations of the VD-TSdescriptor (all the parameters have been already introduced in previoussections). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Page 20: 2.2 Vision-based hand gesture representations

5.11 Optimal size of the sliding temporal window for different video sequences. 85

5.12 Recognition results for the different considered video descriptors. . . . 86

5.13 Parameters for the recognition phase. . . . . . . . . . . . . . . . . . . 88

5.14 Global recognition accuracy. . . . . . . . . . . . . . . . . . . . . . . . 89

Page 21: 2.2 Vision-based hand gesture representations

Glossary

HCI Human-Computer Interaction

SVM Support Vector Machine

LBP Local Binary pattern

VLBP Volume Local Binary Pattern

LBP-TOP Local Binary Pattern of Three Orthogonal Planes

HOG Histogram Of Gradients

SIFT Scale Invariant Feature Transform

S-LBP Spatiograms of Local Binary Patterns

LBP-MC Local Binary Patterns based on Median and Covariance

VD-TS Video Descriptors based on Temporal Sub-sampling

Page 22: 2.2 Vision-based hand gesture representations
Page 23: 2.2 Vision-based hand gesture representations

Chapter 1

Introduction

1.1 Motivation and goals

In the last decades, there has been a great interest in Human-Computer Interac-tion (HCI) systems in order to provide better input interfaces that make interactionswith computers as natural as the interaction among humans. In addition, typicaldevices, such as keyboard and mouse, can not completely satisfy people interac-tion requirements. For these reasons, research studies have focused on developinginterfaces based on natural human interactions, such as speech, touch, and ges-tures, which are the common mechanisms that people use to communicate amongthemselves.

While touch-based interfaces are now very common due to its popularity insmartphones and tablets, vision-based HCI is becoming a popular research topic.Face recognition, body-pose recognition, and hand-gesture recognition are vision-based HCI examples. In particular, interfaces based on hand-gesture recognitionrepresent an attractive and natural alternative to traditional HCI devices, sincethey are less intrusive and more convenient for interacting with 3D spaces. Inaddition, hand gestures play an important role in human communication, becausewe constantly use our hands to interact with objects and gesticulate while we speak.Thus, they can be seen as the most intuitive way for establishing a communicationwith a computer. Hand-gesture recognition has many practical applications in reallife for HCI, such as multimedia application control, video-games, virtual navigationand medical rehabilitation, but it is also used for visual surveillance and analysis ofsport events.

Up to now many works for hand gesture recognition have been developed, butthere are still some challenges affecting its performance. Recognizing a hand, andcharacterizing its shape and motion in images or videos is a complex task. The handdynamics is very complex, since it is a deformable object with more that 25 degrees of

1

Page 24: 2.2 Vision-based hand gesture representations

1.1. Motivation and goals

freedom, involving fingers, wrist, and elbow joints. Therefore, it results very difficultto model its different poses and motion. In addition, the appearance of a hand canchange dramatically because of illumination changes, scaling, blurring, orientations,and occlusions. For example, similar hand gestures performed by people of differentskin colors, or in complex backgrounds can result in large appearance variations. Onthe other hand, intraclass and interclass variance of the gestures are very high. Thesame action performed by the same individual several times may look quite dissimilarin many ways, and this problem gets worse if the same action is performed by twodifferent individuals. Finally, since gestures typically appear within a continuousstream of motion, a temporal segmentation for determining when they start andend is necessary.

One of the most popular approaches for hand-gesture recognition is based onmachine learning algorithms. They are used together with feature extraction tech-niques, more commonly known as descriptors, in order to perform the recognitiontask. On the one hand, descriptors must be able to represent the image region in a re-liable way independently of the scene conditions. For this purporse, descriptors haveto be invariant to translations, rotations, scale changes, and dramatic illuminationchanges. On the other hand, it is desirable that they do not have large dimensionsin order to achieve a high computational efficiency. Therefore, it is necessary to finda good trade-off between recognition accuracy and computational efficiency.

The goal of this project is the development of a more natural, intuitive, user-friendly, and less intrusive Human-Computer interface for controlling an applicationby executing hand gestures. In particular, we have designed and implemented aprototype of a hand-gesture recongition system to imitate a mouse-like pointingdevice, where different mouse functions are triggered depending on the recognizedhand gesture. The system is divided into three stages: detection, tracking andrecognition. The detection stage processes a video sequence frame by frame, anduses a machine learning algorithm together with an image descriptor in order todetect potential hand regions. These detections are employed as input of a trackerto generate a spatio-temporal trajectory of hand regions. Finally, the recognitionstage segments the trajectory in the time dimension, and compute a video descriptor,which is delivered to a set of classifiers to perform the gesture recognition.

One of the main contributions of this project is the design of novel image andvideo descriptors for the detection and recognition stages, respectively. In additionto be invariant to dramatic illumination changes, scale changes and translations,they result more discriminative and computationally efficient than other methods inthe state-of-the-art. On the other hand, in order to test the developed improvements,a database has been created. It contains a specific visual vocabulary based on handgestures, which represent different functionalities for controlling the mouse device.This vocabulary together with the features extracted by the descriptors are usedto learn the machine learning algorithms employed in the detection and recognitionstages.

2

Page 25: 2.2 Vision-based hand gesture representations

1. INTRODUCTION

1.2 Structure

After this introduction, in Chapter 2, we will make a study of the state-of-the-art. First, we will present some gesture taxonomies and the main availabletechnologies for capturing hand gestures, along with their advantages and disadvan-tages. Secondly, we will introduce the different representations to model a hand, andthen we will make a brief review about the most common and popular techniquesused in hand gesture recongtion systems. Finally, we will make a study of severalimage and video descriptors that are used for hand gesture recognition.

In Chapter 3, we will describe the proposed hand-gesture recognition systemand virtual mouse application. We will start introducing a global vision of theframework and their characteristics. Then we will explain in detail the differentimage processing strategies and recognition techniques used in each of the stages,that is, detection, tracking and recognition.

Following, in Chapter 4, we will present our proposed image and video descrip-tors. First we will make a brief introduction about the hand-gesture recognitonchallenge and then we will explain in detail the design of our descriptors.

In Chapter 5, we will explain the experimental results. We will start describingthe characteristics and requirements that the created database must fullfil. Thissection will contain the description of the captured scene, and the description of theproposed gesture vocabulary. Next, we will introduce the employed metrics in orderto quantify the results of our approaches. Finally, the obtained results consideringthe different image and video descriptors for the detection and recognition stages,respectively, will presented.

In order to conclude, in Chapter ??, we will present the final conclusions andthe future work.

Finally, in Appendix A, we will present the additional contributions of thiswork, such us publications, databases, and software.

3

Page 26: 2.2 Vision-based hand gesture representations

1.2. Structure

4

Page 27: 2.2 Vision-based hand gesture representations

Chapter 2

State of the art

2.1 Introduction

The aim of the researches in hand gesture recognition is to design and developalgorithms, techniques and systems that can identify human gestures, and processthem to control some devices, be part of human-computer interfaces, or monitorsome human activities.

In general, the problem of gesture recognition can be divided into two sub-problems: the hand gesture modeling, which is explained in Section 2.2, and theinference problem, which encompasses all the techniques explained in Section 2.3,and 2.4. Several techniques have been proposed for hand-gesture recognition us-ing different hand models and acquisition devices. However the emergence of newtechnologies and improved devices for capturing gestures have fostered new ges-ture representations, recognition tools and techniques to combine them. On theother hand, gesture recognition is still a challenging problem because of the com-plex dynamics of the hand, the great variability of its appearance, and the processof temporal segmentation of dynamic gestures.

In this chapter, we present a review about the state of the art in gesture recogni-tion considering each one of the previous problems. In addition, this review includesthe available data acquisition technologies, different models for hand gesture repre-sentation, and the most popular recognition techniques. On the other hand, sincethe design of descriptors is one of the key issues of this project, we also present themost used and well-known image and video descriptor for feature extraction.

2.1.1 Gesture definition and classification

A gesture can be defined as a non-verbal communication form, articulated bysome part of the body that can be used instead of, or in combination with, a verbal

5

Page 28: 2.2 Vision-based hand gesture representations

2.1. Introduction

communication. Not only hand and body poses or movements are considered asgestures, but also face and eyes movements, such as winking, smiling, nodding, orrolling eyes. We use them everyday, playing an important role in human commu-nication. However, the meaning of a specific gesture can differ from a culture toanother. For example, pointing with a finger is very common in USA and Europe,but it is offensive and rude in Asia. Therefore, the gesture interpretation dependson the culture and also the individual mood.

At this point, we can classify the gestures taken into account the body partsinvolved in the gesture: hand gestures, face gestures, and body gestures. Focusingon hand gestures, the literature classifies them into two main categories: static anddynamic gestures. On the one hand, static hand gestures represent a specific handpose which keep the same orientation and the same position during a period oftime. Keeping close our fist, thumb up, or forming the “ok” symbol, are some statichand gesture examples. On the other hand, if a movement is performed during theconsidered period of time, the hand gesture is dynamic. This movement may be anyshifting of a hand pose, a rotation, changes in the finger positions or a combinationof them, such as opening and closing our hand, and moving our forefinger from topto bottom.

In addition to physical aspects of gestures, research studies have also proposedother psychological taxonomies, since they are very important in order to achieve agood gesture representation for recognition systems [6]. They have been summarizedin the diagram of Figure 2.1. First of all, a gesture can be conscious or unconsciousdepending on if it is intentioned or not, and hereafter, we can distinguish five typesof gestures:

• Emblems: they are quotable gestures which can be directly translated intoshort verbal communication, and are very culture specific. For example, wavinga hand for saying good bye or nodding for assurance.

• Illustrators: they are gestures that we make while speaking. We use them toemphasize a key-point in the speech, thus they are inherent to our thoughtsand speech. Illustrating a throwing action when we pronounce the words “hethrew”, for example, it would be an illustrator gesture. This type of gesturescan be classified into five categories:

– Beats: short and quick gestures that are usually repetitive and rhythmic.

– Deictic gestures: they are pointing gestures which can point to a concretelocation or person, or to an abstract location or period of time.

– Iconic gestures: hand movements to perform a figural representation oran action, like moving the hand upward with wiggling finger to depict treeclimbing.

– Metaphoric gestures: to represent abstractions.

6

Page 29: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

– Cohesive gestures: they are thematically related gestures that are tempo-rally separated due to an interruption of another communicator.

• Affect displays: they are gestures which express emotion or intentions. Thistype of gestures is less culture specific.

• Regulators: they control interaction in conversation.

• Adaptors: they are not used intentionally because they are habits of commu-nicators. They enable the release of body tension, like moving quickly ourleg.

Figure 2.1: Psychological gesture taxonomy.

2.1.2 Enabling technologies for gesture recognition

Implementations of accurate and efficient hand gesture recognition systems havebeen possible because of the availability of more and more modern technology. Ac-cording to the type of the sensor, enabling technologies can be divided into twomain categories, namely contact-based devices and vision-based devices, which aredefined and compared below.

Contact-based devices

Contact-based devices for gesture recognition are based on a physical interactionbetween the user and the interface device in order to increase recognition accuracy.They rely on electromechanically devices which capture the data of gestures via aset of motion sensors that are wired to a computer. These data are then processedvia signal processing and pattern matching to categorize and recognize the differentgestures.

7

Page 30: 2.2 Vision-based hand gesture representations

2.1. Introduction

Some of the main technologies are instrumented gloves, multi-touch screen, ac-celerometers, and tracking devices like joysticks. These devices can include severaldetectors, as multi-touch screen, or can include only one detector as the three-axisaccelerometer for motion-sensing and tilting of the Nintendo Wii-Remote. This typeof devices can be further categorized into five classes:

• Mechanical: they are used by the users in order to capture their movements,such as a body suit called IGS-190, or a wireless instrumented gloves calledCyberGlove II, which are shown in Figure 2.2. The former measures hand-joint angles using three magnetic trackers that determine 3-D hand positions.The latter uses small solid-state inertial sensors to accurately measure theexact rotations of the user’s bones for capturing the motion. Cybergloves andmagnetic trackers have been used in [7] for trajectory modeling in hand gesturerecognition.

• Inertial: these devices measure the earth’s magnetic field in order to detectmotion, such as accelerometers in Wii-Remote and gyroscopes in IGS-190. Forexample, the recognition of gestures with a wii-controller using Hidden MarkovModels was proposed in [8], independently from the target system.

• Haptics: is a sensory technology based on measuring forces, vibrations or mo-tions carried out by the user for recreating the sense of touch. For example,multi-touch screen has become very popular due to its use in tablets and smart-phones.

• Magnetic: these devices measure the variations of an artificial magnetic fieldin order to detect motion. As a disadvantage, they can produce some healthissues.

• Ultra-sonic: they are motion trackers that are composed of sonic emitters (forsending out ultrasounds), sonic discs (for reflecting ultrasounds), and multiplesensors for timing the return pulse. Then, the position and the orientation arecomputed. In spite of having low resolution, they are very useful for environ-ments with little light, magnetic obstacles, or noises.

Vision-based devices

Vision-based devices for gesture recognition use one or several cameras for cap-turing a video sequence. The captured video is processed via image processing andartificial intelligence to recognize and interprete the gestures. The visual appear-ance of a hand can greatly vary because of its deformable nature (more than 25degrees of freedom), the camera viewpoint, different scales, illumination conditions,and variability of the gesture execution (speed and structure). One of the advan-tages of these devices is the ability to capture of a wide range of gestures. We candistinguish several types of sensors:

8

Page 31: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

(a) CyberGlove II. (b) IGS-190 body suit.

Figure 2.2: Contact-based devices.

• Infrared cameras: they use infrared radiation in order to form an image, pro-viding a silhouette of the body, hand or object.

• Color cameras: they are the most common because they are the cheapest.These cameras can have some interesting properties, such us fish-eye lens forwide-angle vision.

• PTZ cameras: Pan-tilt-zoom cameras are able to rotate in a horizontal plane(panning), in a vertical plane (tilting), and zoom manually or automatically.They can focus on a particular object, and track it within the camera field ofview (see Figure 2.3 (a)).

• Depth cameras: they can capture depth information. In particular, some de-vices integrate more than one sensor type, like Kinect 2 and Senz3D whichcapture color and depth information (see Figure 2.3 (b)).

• Stereo-cameras: they have two or more lens in order to simulate the humanvision, and capture 3D information.

• Body markers: they are used for improving the recognition accuracy. Theycan be passive, which can only reflect light; and active, such us LEDs. In thesesystems, each camera delivers 2D frames with marker positions from its view.Finally, a preprocessing step is usually necessary in order to interpret the viewsand positions into a 3D space.

Advantages and disadvantages of both technologies

The aforementioned technologies have advantages and disadvantages, which haveto be taken into account to choose the most appropriate for our system. On the

9

Page 32: 2.2 Vision-based hand gesture representations

2.1. Introduction

(a) PTZ camera. (b) Senz3D sensor.

Figure 2.3: Vision-based devices.

one hand, user cooperation is needed for contact-based devices. Users have to wearsome kind of clothes or devices while performing gestures, what can be uncomfort-able for a long period of time, and can restrict the user movements. But in return,they provide more precise information and less complexity in the implementation.In particular, these techniques provide good results in simulated environments, buttheir use cannot be feasible in real-case scenarios due to their invasiveness and un-controlled context. In addition, the contact nature of these devices can cause allergydue to physical contact with some materials, and risk of cancer due to magnetic ra-diation. On the other hand, vision-based devices need a difficult configuration andsuffer from occlusion problems more than contact-based ones. However, they aremore user-friendly, and therefore more likely for using in the long run. For thesereasons, vision-based systems have become more common in recent years. This dis-cussion has been summarized in Table 2.1, which shows a comparison between themost relevant points from each type of technology.

Criterion Contact-devices Vision-devices

User cooperation Yes No

User intrusive Yes No

Precise Yes/No No/Yes

Flexible to configure Yes No

Flexible to use No Yes

Occlusion problem No(Yes) Yes

Health issues Yes(No) No

Table 2.1: Comparison bewteen contact-devices and vision-devices.

10

Page 33: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

2.2 Vision-based hand gesture representations

Taking into account the aim of this project, we will focus on hand gesturescaptured by vision-based devices. Vision-based approaches consist of using one ormore video cameras to capture a gesture action, which is interpreted by computervision techniques. In this case, it is necessary to model the gesture in order toabstract and represent the different hand poses and their motion. The model to beused depends on the type of application. If an application with a small number ofgestures is needed, then a simple model can be used. However, in an applicationwith a large gesture set, a more detailed model could be required.

For this purpose, several gesture representations have been proposed and imple-mented. We can classify them into two main categories, namely 3D models andappearance-based models, which are explained below, and shown in Figure 2.4.

2.2.1 3D model-based representation

3D hand model-based representations are based on a 3D kinematic hand modelwith a considerable number of degrees of freedom, and try to estimate the hand pa-rameters by comparing the input image with a 2D rendered image resulting from theprojection of the 3D hand model. Firstly, the algorithm proposes some hypothesisparameters based on prior knowledge, such as the previously recovered hand config-uration and hand dynamics. Then, the algorithm uses the hypothesis parameters toanimate the 3D hand model, and project it to a 2D image, which is compared withthe input image. The algorithm continues varying the hypothesis parameters untilthe projected image matches the input image.

These models can be classified into three types, which are explained in increasingorder of complexity:

• Volumetric models: they describe the 3D visual appearance of the hand, hencethey contain high details like skeleton and skin surface information. Whilethese models are very realistic, they require many parameters.

• 3D geometric models: they use structures like cylinders, spheres, ellipsoids andhyper-rectangles to approximate the shape of the hand parts. They are lessprecise than the volumetric models, but they still contain the essential skeletoninformation, and their parameters are much simpler.

• 3D skeleton models: they use a reduced set of equivalent joint and angle pa-rameters, together with segment lengths, in order to provide the informationabout the articulations and 3D degree of freedom. They are the most commonmodels due to their simplicity and high adaptability.

11

Page 34: 2.2 Vision-based hand gesture representations

2.2. Vision-based hand gesture representations

3D hand models offer a detailed description that allows a wide class of handgestures. However, it requires a very large image database to cover all the differentviews, since a hand is a complex articulated deformable object with many degreesof freedom. Therefore, matching the images between the video input and all handimages in the database is time-consuming, and computationally expensive. Otherlimitations of 3D hand models are the lack of the capability to deal with singularitiesthat arise from ambiguous views, and the scalability problem, because a 3D modelwith specific kinematic parameters cannot deal with a wide variety of hand sizesfrom different people.

For these reasons, most of the current 3D hand model-based approaches thatfocus on real-time tracking use global hand models, with local finger motions underrestricted lighting and background conditions to ease the hand recognition task.

2.2.2 Appearance-based representation

Appearance-based representations use 2D image features to model the visualappearance of the hand, which are compared with the extracted image featuresfrom the input image. The most common and popular 2D models are based on thefollowing image features:

• Color: these models characterize the hands taking into account its pixel values,as in [9]. In order to increase accuracy, these models use sometimes handmarkers or color gloves.

• Silhouette geometry: they use the attributes of a binary silhouette, such asperimeter, convexity, surface, bounding box, elongation, rectangularity, cen-troid, and orientation, as model parameters.

• Deformable contour: they are based on deformable active contours or “snakes”with some specific parameters that delineate the hand outline by means ofenergy minimization. Snakes were used in [10] for the analysis of gestures andactions in technical talks for video indexing.

• Motion: they model the hand using patterns of motion known as optical flow.It reflects the image changes due to the motion during a time interval, whichis represented by velocity vectors attached to the moving pixels in the image.

For appearance-based models, it is generally easier to achieve real-time perfor-mance. The main disadvantage is their limited ability to handle different classes ofhand gestures because of the simpler used image features.

12

Page 35: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Fig

ure

2.4:

Diff

eren

tvi

sion

-base

dhan

dm

odel

s.

13

Page 36: 2.2 Vision-based hand gesture representations

2.3. Hand gesture recognition techniques

2.3 Hand gesture recognition techniques

Most of the vision-based systems for recognizing hand gestures include threefundamental phases: detection, tracking, and recognition. After choosing a handmodel, different hand detection methods, together with hand tracking procedures,are used to localize the hand regions in the input image sequence. Once the spatio-temporal sequence of hand regions is obtained, several features are computed inorder to determine the hand model parameters. Finally, some recognition techniqueis applied using the sequence of estimated hand model parameters to identify theperformed dynamic gesture.

In the following discussion we present the most relevant characteristics and tech-niques used in every phase.

2.3.1 Detection

Detection of hands is the first step in hand gesture recognition systems in orderto segment the interest regions, and isolate them from the image background. Thisphase reduces processing time and increases precision of the recognized gestures.Hand detection techniques are usually based on extracting visual features, such asskin color, shape, motion, anatomical hand models, or a combination of them, whichcan be attributed to the presence of hands in the field of view of the camera. In[11], several hand detection approaches were discussed, and in [12] a comparativerevision on the performance of some hand segmentation methods was presented.

Next, we present a classification of the hand detection techniques according tothe used visual features.

Skin color

Approaches based on skin color segmentation are one of the most used in theliterature. Not only the model of skin color has to be robust against human skinvariability and changing illumination conditions, but also an appropriate color spacehas to be chosen (RGB, HSV, YCrCb, YUV). In particular, a color space whichseparates efficiently the chromaticity and the luminance components is preferablein order to remove the effect of shadows, illumination changes, and modulations oforientation of the skin due to the light source [13]. Most of the skin color segmen-tation techniques employ piecewise linear classifiers [14], look-up tables, Bayesianclassifiers with the histogram technique [15], Gaussian classifiers [16], and multilayerperceptrons [17]. However, hands can also be confused by background objects witha color distribution similar to human skin. In this case, detection techniques can becomplemented by a background subtraction phase, as it is proposed in [18].

14

Page 37: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Shape

The shape of the hand has been utilized for hand detection by extracting itscontour, which is generally based on edge detection [19]. However, a large numberof edges that belong to the background are also detected, and the contour can also behindered because of occlusions or degenerate viewpoints. In this case, edge detectionis combined with skin color [20] and background subtraction techniques. On theother hand, there are methods that consider the shape of hands as a deformableshape, and use matching algorithms to find hands [21]. Other works detect theboundary of the hand using a deformable template model [22].

Feature descriptors

In this case, the detection of hands is based on their appearance and texture ingray level images, by extracting local invariant features and using machine learningapproaches. In order to extract feasible features, key point detectors and descrip-tors, which can be invariant to position, scale, rotation and illumination changes,are used [23] [24]. Popular image feature descriptors are Histogram of Oriented Gra-dients (HOG) [25], Scale Invariant Feature Transform (SIFT) [26], and Local BinaryPatterns (LBP) [2]. In Section 2.4, we will present a detailed review of feature de-scriptors for image and video. Regarding learning methods, several techniques suchas AdaBoost algorithm [27], Support Vector Machines (SVM) [20], and ArtificialNeural Networks (ANN), have showed good results in hand detection [28].

3D model

Hand detection based on 3D hand models is view-independent. The utilized 3Dhand models should have enough degrees of freedom to adapt to the pose of the handpresent in an image. Different image features are employed to build feature-modelcorrespondences. For instance, point and line features are used in kinematic handmodels to recover angles formed at the joints of the hands [20]. In other cases, adeformable model is used to fit a 3D model to image data by attracting the modelto the image edges [29].

Motion

Motion is a dynamic feature that can also be employed for hand detection [30].Motion-based methods assume that the only movement in the image is due to handmotion because they need a highly controlled setup. In more recent applications,motion information has been integrated with additional visual cues [31].

15

Page 38: 2.2 Vision-based hand gesture representations

2.3. Hand gesture recognition techniques

2.3.2 Tracking

The tracking layer is responsible for localizing the hands in the image sequenceby performing temporal data association in successive image frames. If the detectiontechnique is fast enough and provides high detection accuracy, it could be used fortracking as well, just by the direct concatenation of detections. However, hands in animage sequence can change its appearance drastically, and move very fast, producingfalse and missing detections. Moreover, in model-based methods, tracking has alsoto be able to provide a way to maintain estimates of model parameters, variables,and features that are not directly observable at a certain moment in time.

Template-based tracking

Methods based on this technique are very similar to methods for hand detec-tion. They use a template to model the hand, and matching algorithms to find thebest candidate in the current frame (the one that is more similar to the template).Some of them invoke the hand detector in order to restrict the image search space,assuming that hands will appear in the same spatial neighborhood. Once the handhas been detected, its position is utilized as reference to detect the hand in the nextframe, together with the other hand features. For instance, some approaches arebased on skin color [32] and hand deformable contours [33], as template models.

Optimal Bayesian estimation

From the viewpoint of feature tracking, the Kalman Filter [34] provides an opti-mal Bayesian estimation framework in order to turn observations (feature detections)into estimations (extracted trajectory). It is one of the most widely used methodsfor tracking and estimation because of its real-time performance [35], treatment ofuncertainty, and the provision of predictions for the successive frames. Kalmal Filteris a linear estimator, but if the system dynamics are non-linear, other extensionssuch as Extended Kalman Filter [36], Unscented Kalman Filter [37], and ParticleFiltering can be used.

Particle filtering

A particle filter-tracker has been used to track position of hands and finger con-figurations in dense visual clutter. These techniques maintain a probability distribu-tion, which is represented by a set of particles, over the location of the object beingtracked [38]. Its main disadvantage is that many particles can be required for com-plex models, such as a human hand. In this context, an example of visual trackingis the condensation algorithm [39], which has been utilized to track curves againstcluttered backgrounds, exhibiting a better performance than Kalman Filters, andalso operating in real-time [40].

16

Page 39: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Mean Shift

Mean-shift algorithm is an iterative method that detects local maxima of a den-sity function by shifting a kernel towards the average of data points in its neighbor-hood. It has been widely utilized for tracking objects in image sequences, generallyby matching a color probability of a target with that of the object model [41], be-cause of its simplicity and low computational cost. Some extensions that addresschanges in scales have been proposed, such as the Continuous Adaptive Mean-Shift(CamShift) [42]. It adjusts the tracked window size to objects that change theirscale along the time in a robust and efficient way [43].

2.3.3 Recognition

The last step is the own hand gesture recognition. Once the gesture regionshave been extracted in the previous phases, they are analyzed to identify specificgestures. Recognition techniques uses the spatio-temporal data, or some spatio-temporal features, in order to match them with some training data and/or classifythem. Depending on the application and system requirements, different approachessuch as statistical modeling, computer vision, pattern recognition, and image pro-cessing are used in this phase.

In order to recognize static gestures, a general classifier (or even a template-matcher) can be used, such as K-nearest neighbor, Neural Networks, or SupportVector Machines (SVM). However, dynamic hand gestures have a temporal aspect tobe considered, and require techniques that handle this additional dimension, such asthe Hidden Markov Models (HMM). An alternative is that the temporal dimensioncan be modeled along with the hand gesture representation. For example, spatio-temporal features (or video descriptors), which will be described in Section 2.4, andthen to use standard classification techniques.

The methods employed for static gestures are further divided into linear andnon-linear classifiers. The former is suitable for linearly separable data, and thelatter for the other cases. According to their outcome, learning algorithms can bealso classified into supervised learning, which mainly matches samples to labels, andunsupervised learning, which clusters samples without labels. The choice of thelearning algorithm depends mainly on the chosen hand gesture representation.

Regarding dynamic gestures, automata-based methods are the most commonapproaches. For instance, Finite State Machine (FSM) and HMM are examplesof automaton with a set of states, and a set of transitions. The states representstatic hand gestures (postures), and transitions represent allowed changes betweenone posture and another with temporal and/or probabilistic constraints. The mainlimitation of these approaches is that the gesture model must be modified when anew gesture needs to be recognized. In addition, the computational complexity isgenerally huge, since it is proportional to the number of features to be recognized.

17

Page 40: 2.2 Vision-based hand gesture representations

2.3. Hand gesture recognition techniques

One the other side, they provide better results and a higher recognition accuracythan other methods.

Figure 2.5 shows an overview of the different hand gesture recognition techniquesfor every phase. In the following lines we are going to detail some of the mostcommon techniques used for static and dynamic hand gesture recognition.

K-nearest neighbor

K-nearest neighbor (k-NN) is a method for classifying objects based on the closesttraining examples in the feature space. It is a type of instance-based learning wherethe function is only locally approximated, and all computations are postponed untilclassification. An object is classified by the majority vote of its k nearest neighbors.In order to identify neighbors, the objects are represented by vectors in a multidi-mensional feature space. It is usual to use the Euclidean distance as in [44]. Thek-NN algorithm is sensitive to the local structure of the data.

Support Vector Machine

Support Vector Machines (SVM) along with the use of kernels are non-linearclassifier that map the input data to some high dimensional space, where the datacan be linearly separated, providing great classification performance [45]. Given a setof training samples, each one marked as belonging to one of two categories, an SVMtraining algorithm builds a model that assigns new examples into one category or theother. An SVM model is a representation of the samples in a feature space, mappedin such a way that the samples of the separate categories are divided by a clear gapthat is as wide as possible. New samples are then mapped into that same space, andpredicted to belong to one of the categories based on which side of the gap they fallon [46]. In addition to performing linear classification, SVMs can efficiently performnon-linear classification using what is called the kernel trick, implicitly mappingtheir inputs into high-dimensional feature spaces. One of the bottle-necks of theSVM is the large number of support vectors used from the training set to performthe classification task.

Hidden Markov Models

Hidden Markov Models (HMM) were introduced in the mid 1990s becoming therecognition method of choice because of its implicit solution to the temporal seg-mentation problem [47]. An HMM is a statistical model in which a set of hiddenparameters is determined from a set of related observable parameters. In an HMM,the state is not directly observable, but instead, variables influenced by the state are.Each state has a probability distribution over the possible output values. There-fore, the sequence of values generated by an HMM provides information about thesequence of states. In the context of gesture recognition, the observable parameters

18

Page 41: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Figure 2.5: Vision-based hand gesture recognition techniques.

19

Page 42: 2.2 Vision-based hand gesture representations

2.3. Hand gesture recognition techniques

are estimated by recognizing postures (values) in images. For this reason and be-cause gestures can be recognized as a sequence of postures, HMMs have been widelyutilized for gesture recognition [48]. On the other hand, the state transitions rep-resent the probability that a certain hand position transitions into another. In thiscontext, it is typical that each gesture is handled by a diferent HMM. The recogni-tion problem is transformed to the problem of selecting the HMM that matches bestthe observed data, given the probability of a state being observed with respect tothe context. This context may be spelling or grammar rules, the previous gestures,cross-modal information, and others.

Dynamic time warping

The Dynamic Time Warping (DTW) algorithm has been long used to find theoptimal aligment of two signals. It calculates the distance between each possiblepair of points out of two signals in terms of their associated feature values. Thesedistances are used to calculate a cumulative distance matrix, and find the leastexpensive path through this matrix. This path represents the ideal warp, that is,the synchronization of the two signals which causes the feature distance betweentheir synchronized points to be minimized. For example, DTW has been used ingesture recognition in [49].

Time delay neural networks

Time Delay Neural Networks (TDNN) are considered as an extension of multi-layer perceptron. TDNN is based on time delays, which give individual neurons theability to store the history of their input signals. Thus, the network can adapt to asequence of patterns. Due to the concept of time delay, each neuron has access notonly to the present input at time tn, but also to the inputs at time t1, t2, ..., tn−1.Therefore each neuron can detect relationships between the current and the formerinput values, which might be a typical pattern in the input signal. Learning of typ-ical TDNN can be accomplished by standard back-propagation. TDNN focuses onworking with continuous data making the architecture adaptable to online networks,and advantageous for real time applications. In [50], hand gestures are recognizedthrough a TDNN for the control of a multimedia application.

Finite state machine

A Finite State Machine (FSM) has a limited or finite number of possible states.Usually, the training of the model is done off-line, using many examples of eachgesture as training data, and then the parameters of each state in the FSM arederived. The recognition of hand gestures can be performed online using the trainedFSM [51]. When input data (feature vectors such as trajectories) are supplied tothe gesture recognizer, the latter decides whether to stay at the current state of the

20

Page 43: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

FSM, or jump to the next state based on the parameters of the input data. If itreaches a final state, we say that a gesture has been recognized.

2.4 Feature extraction techniques

Local image and video features have been proven successful for many computervision applications, such as image representation, object recognition [52] and match-ing [53], 3D scene reconstruction, and object tracking [54]. In this case, taking intoaccount the aim of these project, we focus on feature descriptors for visible imageryin order to detect and recognize hand gestures.

During the last three decades, several local descriptors have been proposed torepresent image regions and video volumes by exploiting their local appearance prop-erties and structures, instead of using the images or video sequences as a whole. Thesimplest feature descriptor could be a vector formed by the values themselves of thepixels in a region of interest. However, this approach would not be very efficientdue to the high resulting dimensionality. And this is even worse for video sequences,since the temporal domain has to be taken into account, as well. Therefore, it isnecessary to design descriptors with a small size (to be computationally efficient)that encodes the most relevant information of the image/video regions from a clas-sification viewpoint. In addition, a good descriptor must also have other desirableattributes:

• Discriminative. We expect that visually similar image regions should have sim-ilar descriptors, and that visually different image regions should have differentdescriptors.

• Invariant. We can expect that despite a transformation (i.e. rotation), twovisually similar image regions should still have similar descriptors.

• Robust. Visually similar image regions should have similar descriptors despitethe distortions (i.e. illumination changes).

Hence, feature descriptors represent, identify, and compress image information,capturing the most important and distinctive image characteristics. Finally, theyare employed to compare different regions of interest by a similarity/dissimilarityfunction, in order to make a classification.

2.4.1 Image descriptors

Image feature descriptors analyze the appearance of an image region in the spatialdomain. The shape and size of such regions depends on the nature of the description

21

Page 44: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

mechanisms. Usually, those regions are small sets of pixels of regular shape such asrectangles, ellipses, or circles from where the feature vector is extracted.

In this section, we are going to present the most relevant and successful descrip-tors, and those that have generated more variations, extensions or modificationssince their publication. However, it is worth mentioning that are continuosly ap-pearing new mechanisms for image description.

Scale Invariant Feature Transform (SIFT) and Histograms of Oriented Gradients(HOG) are two of the most popular descriptors, which are explained below. However,in the last few years, a new descriptor called Local Binary Patterns (LBP) hasbeen achiving very good results, generating a great interest due to its fundamentaland specific attributes, and its large number of extensions. In addition, the LBPdescriptor presents a high scalability in comparison to SIFT, HOG, and others.This fact is very important since the feature vector always have the same dimensionindependently of the shape and size of the image region, whereas SIFT and HOGneed that the image region has a particular size in order to extract the desirablefeature appropiately. For all these reasons, the LBP has been our starting point inthe development of new feature descriptors, and will be described extensively in thenext sections.

2.4.1.1 Scale Invariant Feature Transform

The Scale Invariant Feature Transform (SIFT) algorithm was proposed in orderto extract distinctive invariant features from images, and use them to carry outreliable matching between various views of an object or a scene [26] [1]. The goal isto create a descriptor for the image region that is compact, highly discriminative,and robust against illumination changes and 3D camera viewpoint. The feature de-scriptor used by SIFT is created by sampling the magnitudes and orientations of theimage gradient in the region around a keypoint, and building smoothed orientationhistograms to capture the important aspects of the region. The SIFT features arealso well localized in both the spatial and frequency domains, reducing the probabilyof disruption by occlusion, clutter or noise.

This algorithm consists of four major stages: (1) scale-space extrema detection,(2) keypoint localization, (3) orientation assignment, and (4) keypoint descriptor.In the first stage, potential interest points are indentified by scanning the imageover location and scale. This is implemented efficiently by constructing a Gaussianpyramid and searching for local maxima in a series of difference-of-Gaussian (DoG)images. In order to detect the local maxima or minima of the DoG images, everypixel in the DoG images is compared to its 8 neighbors in the current scale and its9 neighbors in the scale above and below (see Figure 2.6).

In the second stage, an interpolation of nearby data is used to determine itsposition, scale, and ratio of principal curvatures, as it is showed in Figure 2.7. This

22

Page 45: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

(a) Gaussian pyramid and difference-of-Gaussian (DoG) images. (b) Pixel marked with × is compared

to its 26 neighbors in 3x3 regions at the current and adjacent scales in order to detect maxima and

minima of the DoG images.

Figure 2.6: First step to compute the SIFT descriptor (extracted from [1]).

information allows points to be rejected if they have low contrast, or are poorlylocalized along an edge.

The third stage identifies the dominant orientation of the keypoint by computinga gradient orientation histogram in the neighbourhood of the keypoint, using theGaussian image at the closest scale to the keypoint scale. The assigned orientation,scale, and location for each keypoint enables SIFT to construct a canonical viewfor the keypoint, which is invariant to similarity transformations. In the final stage,the feature descriptor is computed as a n×n array of orientation histograms with rorientation bins, which results in a feature vector with n2× r elements. Finally, thisvector is normalized to improve invariance to variations in illumination. Figure 2.8shows a 2× 2 array of histograms with 8 orientations, computed from an 8× 8 setof samples. Therefore, the feature vector for each keypoint will contain 22 × 8 = 32components.

These two parameters, the number of orientations, r, and the width of the arrayof orientation histograms, n, can vary the complexity of the descriptor. As thecomplexity of the descriptor grows, it will be able to discriminate better in a largedatabase, but it will also be more sensitive to shape distortions and occlusions.

23

Page 46: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

(a) Original image. (b) The initial 832 keypoints are displayed as vectors indicating scale, orientation,

and location. (c) After applying a threshold on minimum contrast, 729 keypoints remain. (d) After

applying a threshold on ratio of principal curvatures, 536 keypoints remain.

Figure 2.7: Stages of keypoint selection(extracted from [1]).

Figure 2.8: SIFT feature description. Left: image gradients. Right: key point descriptor(extracted from [1]).

24

Page 47: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

2.4.1.2 Histogram of Oriented Gradients

Histogram of oriented gradients (HOG) can be used as feature descriptors forthe purpose of object detection [25], where the occurrences of gradient orientationin localized parts of an image play important roles. This technique is similar toSIFT, but it differs in that it operates on a dense grid of uniformly spaced cells anduses local contrast normalization on overlapping blocks for improving the accuracy.The idea behind HOG is that the appearance and shape of local objects within animage can be well described by the distribution of intensity gradients and the votesof dominant edge directions. The process to compute a feature descriptor can beobtained as follows.

Figure 2.9: The processing chain of HOG feature descriptor.

First, the image is divided into small contiguous regions of equal size, called cells,as shown Figure 2.9, where image gradients are computed. The second step consistof collecting a histogram of gradient directions (or edge orientations) from the pixelswithin each cell. Every pixel in the cell calculates a weighted vote (contribution) foran edge orientation histogram based on the orientation and the gradient magnitudeof the pixel. Finally, in order to obtain the robustness against illumination andcontrast changes, the gradient strengths must be locally normalized. This leads togrouping the cells into larger pixel regions called blocks. These blocks overlap withneighboring blocks, so that each cell can contribute to its orientation distributionmore than once. The final HOG feature descriptor is then the vector containing theelements of the normalized blocks histograms from all of the block regions.

25

Page 48: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

2.4.1.3 Local Binary Patterns

Local Binary Pattern (LBP) has become a popular descriptor for several tasks,such as texture classification, face recognition, video background subtraction, andmotion analysis. Two of its most important attributes are its robustness to dramaticillumination changes and computational efficiency. In addition, it has proven to behighly discriminative. This operator calculates a binary pattern by thresholding theneighborhood of each pixel, and considering the result as a binary number. Then, ahistogram is generated from all the computed binary numbers, which is used as thedescriptor or feature vector of the image.

The LBP operator was originally designed for texture description [55]. Figure2.10 summarizes its computation. This operator thresholds a 3x3 neighborhood bythe intensity value of the center pixel, and therefore it only takes into account thesign of the differences between the values of the neighbors and the value of thecenter pixel. The thresholded values are concatenated in an 8-bit binary number,starting at the upper left corner, from left to right and from top to bottom, whichrepresents the texture pattern of the considered neighborhood. This binary numbercould be also ordered in a different way. Next, all the extracted binary numbersare converted to decimal numbers, which represent texture labels. Finally, they areused to generate a histogram of 28 = 256 labels.

(a) 3x3 gray scale neighborhood. (b) Differences between the center pixel and its neighbors. (c)

Thresholded neighborhood. The binary pattern is converted to decimal and represents one of the

histogram’s labels. (d) Histogram of LBP from the whole image.

Figure 2.10: Local binary pattern from a pixel neighborhood.

The invariance to monotonic gray-scale changes is achieved by the fact that LBP

26

Page 49: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

takes only into account the signs of the differences. In the presence of illuminationvariations, the pixel values can dramatically change, but their intensity relationshiprarely changes. Therefore, the sign is more reliable than the magnitude (absolutevalue) of the difference. This is the most important attribute which has popularizedthe LBP operator. An additional advantage is relatively short size of the resultantfeature vector, which makes it compatible with real-time classification. Hence, LBPachieves an excellent tradeoff between efficiency and computational cost.

On the other hand, the LBP operator does not provide any spatial informationbecause it does not identify from what part of the image the binary patterns comefrom. The generated histogram takes only into account the pattern occurrencesdiscarding the spatial information. This can be a disadvantage for some applicationsthat could use the spatial information to be more discriminative.

Multi-scale Local Binary Patterns

In order to deal with textures at different scales, the LBP operator was extendedto use neighborhoods of different sizes [2]. The new neighborhood pattern is definedas a set of sampling points evenly spaced on a circle centered at the pixel to belabeled. This new version of LBP allows any radius and number of sampling points,which allows a multiresolution processing that expands its applicability. In addition,it keeps all properties previously described for the standard LBP.

The notation for defining this operator is LBPP,R, which P means number ofsampling points on a circle of radius R. The mathematical expression to obtain alabel from LBPP,R is:

LBPP,R =P−1∑p=0

s(gp − gc)2p, (2.1)

where gc corresponds to the gray value of the center pixel of the local neighborhood,gp(p = 0, ..., P − 1) corresponds to the gray values of the P equally spaced samplingpoints on the circular neighborhood, and s(x) is the sign function defined as:

s(x) =

{1, x ≥ 00, x < 0.

(2.2)

The coordinates of gp can be expressed in polar coordinates as (xp, yp) =(R cos(2Πp/P ),−R sin(2Πp/P )). Figure 2.11 shows circularly symmetric neighborsets for different number of sampling points and radius. Bilinear interpolation isused when a sampling point does not fall in the center of a pixel.

Uniform Local Binary Patterns

In spite of the excellent results that LBPP,R descriptor provides, using the wholerange of possible LBP patterns may not be reliable to describe the image region

27

Page 50: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

Figure 2.11: Circularly symmetric neighbor sets for different P and R (extracted from [2]).

under consideration, and may not achieve a good classification score. Uniform pat-terns [2] are based on the fact that some patterns happen more frequently thanothers. Uniform patterns represent fundamental image structures, such as edges,flat areas, spots, and corners, which are usually the dominant patterns among allpossible ones. Thus, a histogram is generated taking only into account these uni-form patterns, which are supposed to capture the most relevant information in animage region. Patterns which rarely happen can make histogram sparse, tendingto decrease the overall performance. On the other hand, considering only uniformpatterns reduce the histogram size, which in turn decreases the computational cost.This attribute has made uniform patters to became very popular and used in someapplications.

A LBP pattern is called uniform if the binary pattern contains, at most, twobitwise transitions from 0 to 1, or vice-versa. For example, the patterns 00000000 (0transitions) and 01110000 (2 transitions) are uniform, whereas the patterns 11001001(4 transitions) and 01010010 (6 transitions) are not. LBPu2

P,R is the notation used torefer to this operator, where u2 means uniform patterns up to two transitions. Thegenerated histogram has a separate bin for every uniform pattern, while all non-uniform patterns are assigned to a single bin. In this way, using a neighborhoodwith P=8, from 256 LBP patterns only 58 are uniform (see Figure 2.12), whichresults in a histogram with 59 different labels.

It has been proven that uniform patterns account for a bit less than 90% of allpatterns when using a LBPu2

8,1 configuration, and for around 70% with a LBPu216,2

configuration.

Rotation invariant Local Binary Patterns

We know that LBPP,R operator produces 2P different binary patterns consideringthe P sampling points in the neighborhood set. In the case that the image is rotated,the gray values of the neighborhood, gp, will move on the corresponging directionalong the circumference of radius R and center gc. Hence, rotating a particular LBPpattern naturally results in a diferent LBP pattern. This does not apply to patterns

28

Page 51: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Figure 2.12: The 58 different uniform patterns with a LBP8,R configuration (extractedfrom [3]).

29

Page 52: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

00000000 and 11111111, which remain constant at all rotation angles. In this way,if a fixed LBP pattern is assigned to a group of LBP patterns that result in thesame LBP under some binary shifting operation, rotation invariance is achieved [2].Therefore, in order to remove the effect of rotation, the LBPri

P,R is defined as follow:

LBPriP,R = min{ROR(LBPP,R, i) | i = 0, 1, ..., P − 1}, (2.3)

where ROR(x) performs a circular bitwise right shifting on the P-bit number x, itimes. In terms of image pixels, it simply corresponds to rotate the circular neigh-borhood clockwise direction i times. As a result, LBPri

P,R operator quantifies theoccurrences of individual rotation invariant patterns corresponding to certain fea-tures in the image.

There are 36 rotation invariant local binary patterns that can occur in the case ofP=8, which are showed in Figure 2.13. Any binary pattern obtained from a LBP8,R

operator can be represented as one of the 36 mentioned patterns after shifting itsbits appropiately.

Figure 2.13: The 36 different rotation invariant patterns with a LBP8,R configuration(extracted from [2]).

Dominant Local Binary Patterns

There are some cases in which uniform patterns are not the dominant patterns.In an image mostly consisting of straight edges or low curvature edges, the LBPu2

8,R

operator is able to capture the fudamental information, but when an image is formedby complicated structures, shapes, and irregular edges, it is neccesary to increasethe number of sample points in the neighborhood to capture the information ap-propiately. In this situation, the number of transitions in the binary number growwith the number of neighbors, which makes uniform patterns to reduce its propor-

30

Page 53: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

tion among all LBP patterns. As a result, they are not suitable for representing theimage.

However, there can be still dominant patterns from a specific image that can beused for image representation, because they are reliable, robust and highly discrimi-native. In order to develop this idea, Dominant Local Binary Patterns (DLBP) wereproposed [56]. This operator extracts image features for texture classification mak-ing use of the most frequently occurring patterns, which are considered to be aboutthe 80%. It calculates the occurrences of all rotation invariant patterns and sortsthem in descending order. The first ones should contain dominant structures in theimage, which are dominant patterns. These patterns are more prone to representthe relevant information in the image.

It is noted that DLBP descriptor discards information regarding the dominantpattern types. It takes only into account the information about the pattern oc-currences. According to the experimental results, omitting this information is notdamaging, allowing to consider a non-fixed set of patterns.

Complete Local Binary Patterns

Complete Local Binary Patterns (CLBP) were proposed to generalize and com-plete the LBP descriptor, by considering the center pixel and a local differencesign-magnitude transform (LDSMT) [4]. On the one hand, the center pixel is thresh-olded and coded by a binary code, resulting a binary map which is named as CLBP-Center (CLBP C). On the other hand, LDSMT decomposes the local differences intotwo complementary components, the sign and the magnitude. Two operators areproposed to code them, denoted by CLBP-Sign (CLBP S) and CLBP-Magnitude(CLBP M). The conventional LBP is equivalent to the CLBP S, which preservesmore information about the local structure than CLBP M, and can extract the tex-ture features reasonably well. However, the magnitude component and the intensityvalue of the center pixel may contribute with additional and useful information aswell. Hence, a significant improvement can be made for texture classifitcation bycombining CLBP S, CLBP M, and CLBP C features.

Figure 2.14: Framework of Complete Local Binary Patterns (extracted from [4]).

31

Page 54: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

Given a center pixel gc and its P sampling points gp, evenly spaced on the circularneighborhood, the differences between gc and gp are calculated as dp = gc− gp. Thelocal difference vector dp = [d0, d1, ..., dP−1] represents the local structure of theregion centered at gc. Each dp can be further decomposed into two components:

dp = sp ·mp and

{sp = sign(dp)mp = |dp|,

(2.4)

where sign(x) is the same function as the one used in Equation 2.2, and mp is theabsolute value of dp. Thus, dp is transformed into a sign vector sp = [s0, s1, ..., sP−1]and a magnitude vector mp = [m0,m1, ...,mP−1]. Both of them are complementaryand the original difference vector can be perfectly reconstructed from them.

Since the magnitude vector is made of continuos values instead of binary values,it has to be coded in a consistent format using the CLBP M operator:

CLBP MP,R =P−1∑p=0

t(mp, c)2p, (2.5)

where c is a threshold to be determined adaptively, and the function t(x, c) is definedas:

t(x,c) =

{1, x ≥ c0, x < c.

(2.6)

In order to make CLBP C consistent with CLBP S and CLBP M, it is coded bya binary code as:

CLBP CP,R = t(gc, cI), (2.7)

where cI is a threshold that can be set as the average gray level of the whole image.

Finally, the three operators can be combined in two ways: jointly or hybridly.Following the first option, a 3D histogram ca be built, denoted by CLBP S/M/C.In this case, the representation is highly discriminative, but it is not very compact.The second choice is firstly building a 2D histogram (CLBP S/C or CLBP M/C),which is later converted to a 1D histogram. Finally, it is concatenated with the otherone (CLBP M or CLBP S) to generate a joint histogram denoted by CLBP M S/Cor CLBP S M/C. In this other case, the representation is more compact, but lessdiscriminative.

2.4.2 Video descriptors

Video descriptors have been designed to extend the feature analysis from thespatial domain to the spatio-temporal domain. Hence, appearance and motion canbe combined. In addition, they should be relatively independent of spatio-temporalshiftings and scales, and robust to background clutter and multiple motions in thescene.

32

Page 55: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Various matching-based methods have been proposed for model learning by con-sidering invariant visual representations of example gestures. Among those visualrepresentations, local spatio-temporal features [57] are the most widely exploited,however, most of them are used in human action recognition [58] [59]. Using thesetechniques for hand gesture recognition is not suitable because the durations of ges-tures are much shorter than human activities, and only a limited spatio-temporalrange of features can be extracted. This causes that their effectiveness is accord-ingly degraded. Furthermore, the extraction of these features is generally slow, andthey do not offer a scalable solution for efficient matching when the database islarge. Other descriptors include motion trajectories [60], spatio-temporal gradients[61], and global histograms of optical flow [62]. However, the comparison of existingmethods is often limited given the different range of used experimental settings.

Since there is not any video descriptor that has been used explicitly for handgesture recognition, we continue focusing on LBP descriptor and describing its ex-tensions for the spatio-temporal domain in order to compare our approach withthem.

2.4.2.1 Volume Local Binary Patterns and LBP from Three Orthogonal Planes

In order to extend texture analysis from spatial domain to spatio-temporal do-main, Volume Local Binary Patterns (VLBP) and Local Binary Patterns from ThreeOrthogonal Planes (LBP-TOP) were proposed [5]. These extensions of the LBP op-erator are used to describe dynamic textures, so that they are able to combinemotion features with appearance features.

Volume Local Binary Patterns

The idea behind VLBP is the same as the LBPP,R operator, except that it isextended to the previous and posterior neighboring frames, as well. Given a pixelbelonging to a specific frame, its local volume neighborhood is formed by its P spatialsampling points in the same frame (as in LBPP,R), the center pixel in some previousframe, and its P sampling points, and the center pixel in the posterior frame andits P sampling points, obtaining a total of 3P+2 neighbors. Thus, VLBP descriptoruses three parallel planes, from which only the middle one contains the center pixel.Once the gray value of the center pixel is substracted from the gray values of thecircularly symmetric neighborhood, and the sign of these differences is considered,and then we obtain a dynamic texture as:

v = (v0, v1, ..., v3P+1) = (s(gtc−L,c − gtc,c), s(gtc−L,0 − gtc,c), ...,s(gtc−L,P−1 − gtc,c), s(gtc,0 − gtc,c), ...,s(gtc,P−1 − gtc,c), s(gtc+L,0 − gtc,c), ...,

s(gtc+L,P−1 − gtc,c), s(gtc+L,c − gtc,c)), (2.8)

33

Page 56: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

where s(x) is the same function as the one used in (2.2), gtc,c corresponds to thegray value of the center pixel of the local volume neighborhood, gtc−L,c and gtc+L,c

corresponds to the gray value of the center pixel in the previous and posteriorneighboring frames with time interval L, and gt,p(t = tc−L, tc, tc+L; p = 0, ..., P−1)correspond to the gray values of P neighbors in frame t. Finally, v is transformedinto a unique VLBPL,P,R number that represents the spatial structure of the localvolumetric dynamic texture:

V LBP L,P,R =3P+1∑q=0

vq2q. (2.9)

Figure 2.15 shows how to extract a volume local binary pattern in detail. Thefinal histogram has 23P+2 labels, where P is the parameter that determines thenumber of features. If P is large, the histogram will be very long, and this will restrictits applicability. However, if P is small, the feature vector loses more information,but instead it is more compact.

Figure 2.15: Procedure to compute VLBP1,4,1.

34

Page 57: 2.2 Vision-based hand gesture representations

2. STATE OF THE ART

Local Binary Patterns from Three Orthogonal Planes

In order to solve the dimensionality problems of the VLBP descriptor, the LBP-TOP descriptor proposes to concatenate LBP histograms from three orthogonalplanes: XY, XT, and YT, as is showed in Figure 2.16. The XY plane representsappearance information, while the XT plane gives a visual impression of one rowchanging in time, and the YT plane describes the motion of one column in temporalspace.

Figure 2.16: Procedure to compute the LBP-TOPPXY ,PXT ,PY T ,RX ,RY ,RTdescriptor.

Thus, LBP-TOP uses three orthogonal planes that intersect in the center pixel.In order to extract a LBP histogram from each plane, a set of parameters has to befixed. These parameters are the number of sampling points in the circular neighbor-hood in each plane, and the radius of the circular neighborhood in each plane. Oncethe three histograms are computed, they are concatenated to form the feature vec-tor. Therefore, the resulant descriptor, denoted by LBP-TOPPXY ,PXT ,PY T ,RX ,RY ,RT

,will contain 2PX + 2PY + 2PT labels. If PX = PY = PT = P , we can see that theLBP-TOP descriptor will have 3 · 2P labels, instead of 23·P+2 ones that had theVLBP descriptor. Notice that the dimension of the feature vector is considerablyreduced. On the other hand, the radius in the time axis will depend on the framerate of the analysed sequence, as in VLBP descriptor. Figure 2.17 shows differentconsidered neighborhoods.

35

Page 58: 2.2 Vision-based hand gesture representations

2.4. Feature extraction techniques

(a) Intersection of the three orthogonal planes. (b) Neighborhood in the XY plane. (d) Neighborhood

in the XT plane. (d) Neighborhood in the YT plane.

Figure 2.17: Volumetric neighborhood of a pixel considering a LBP-TOP16,8,8,3,3,1 config-uration (extracted from [5]).

36

Page 59: 2.2 Vision-based hand gesture representations

Chapter 3

Description of the hand-gesturerecognition system

3.1 System overview

A robust vision-based hand-gesture recogntion system has been developed inorder to provide a more natural and intuitive Human-Computer Interaction (HCI)interface. In particular, we present a framework that simulates a mouse-like pointingdevice in order to replace the mouse device, and interact with the computer by per-forming hand gestures. In addition to control the mouse device, this framework canbe extended to control multimedia applications [63], video-games [64], and medicalsystems [65].

For this purpose a gesture vocabulary has been created. Since this applicationtries to imitate a mouse device, we propose five gestures in order to simulate fivedifferent functions: mouse activation, the movement of a cursor, left click, right click,and mouse deactivation. Each one of the gestures will be described in Section 5.1.The different mouse functions are triggered depending on the recognized gesture.

Our recogntion system is based on machine learning techniques. In particular, weemploy SVM classifiers [45] because they work very well with high-dimensional data,and are capable of delivering high performance in terms of classification accuracy.Furthermore, one of our goals is to integrate image and video descriptors as partof the recognition task. In this case, the most common and appropiate tool to beused together with feature extraction techniques is a general classifier or a templatematcher, as we saw in the previous chapter.

In order to recognize hand gestures from video sequences, we propose an approachcomposed of three stages: detection, tracking, and recognition. The implementedstrategy in the detection and recogntion phases consits of using a set of SVM clas-sifiers together with some feature extraction technique. The detection phase uses

37

Page 60: 2.2 Vision-based hand gesture representations

3.2. Hand detection

a SVM classifier to detect determined static hand poses, whose input is a set offeatures extracted by some image descriptor. The detections are used as input ofa multiple object tracker in order to generate a trajectory of hand poses. Finally,the recognition phase uses the tracked hand regions to compute spatio-temporalfeatures, which are delivered to a bank of SVM classifiers that perform the gesturerecognition.

The image and video descriptors that have been developed in this project arebased on color imagery, and therefre the chosen model to represent the hand gesturesis an appearance-based model. We have chosen a color-based monocular camera inorder to capture the hand gestures in the scene. Almost all computers include onebecause of their usefulness in multimedia application and low price.

The system has the advantage of being hihgly modularized. Each one of thethree phases is a section separated from the others, which allows a easier integrationand location of possible errors. In turn, every phase is also modularized. Froma implementation viewpoint, they have a global structure that allows to configuretheir parameters and other internal modules easily. The system overview can beseen in Figure 3.1.

Figure 3.1: Block diagram of the proposed hand-gesture recognition system.

3.2 Hand detection

The aim of the detection stage is to detect the hand poses and transitions thatcan be part of the considerd dynamic hand gestures. The detected hand poses willbe used as input of the tracking phase. The detection task is based on a machinelearning technique, and an image descriptor for feature extraction. The overview ofthis phase is shown in Figure 3.2. In this case, we employ a binary SVM classifiersince we only deal with two classes. In order to achieve a higher performance, aHellinger kernel, more commonly known as Bhattacharyya coefficient [66], has beenused. It allows to learn non-linear decision boundaries by projecting the features ina higher dimensional space, where linear boundaries can be computed to separate

38

Page 61: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

the different classes. It can be mathematically represented as:

k(f, f ′) =∑i

√f(i)f ′(i), (3.1)

where f and f ′ are normalized histograms (the used features are based on his-tograms).

On the other hand, the feature extraction is carried out by the S-LBP descriptor,a new descriptor designed by us, which will be presented in Section 4.2. It computesfeature vectors which are robust against dramatic illumination changes and slightimage translations, and are computationally efficient. They also includes spatialinformation in a compact way that makes them highly discriminative regardingother approaches.

Figure 3.2: Block diagram of the hand detection phase.

To be able to detect hand poses in different spatial locations and scales, multiplesliding windows of different sizes are used in each frame. The sliding windowsare overlapped, where the overlapping magnitude between consecutive windows isdetermined by the spatial step of the sliding window, which is a design parameter.The choice of this parameter is a trade-off between the computational cost and theaccuracy of the detection. The smaller the spatial step is, the larger the number ofwindows is, and therefore the computational cost increases. This strategy is whollygeneric and behaves very well in practice.

On the one hand, in order to simplify the analysis in the detection phase, weonly consider a unique aspect ratio for the sliding window, which is determined bythe morphology of the hand poses. We select a rectangular bounding box becauseis a simple shape and can be adapted to every hand pose better than a square. Theaspect ratio of the sliding window is another design parameter.

There is a dependency among the size of the window, the spatial step, and theclassifier. For example, if we work with windows that tightly surround the object,then we might use a classifier more accurately trained. However, we would have touse smaller spatial steps, having to process more windows. If we use windows thatare rather larger than the object along with a bigger spatial step, we will have lesswindows to process, but our ability to detect and localize objects will decrease. Inour case, we use windows that tightly surround the object for training the classifieralong with small spatial step.

39

Page 62: 2.2 Vision-based hand gesture representations

3.2. Hand detection

On the other hand, we carry out a multi-scale analysis by generating a multirreso-lution pyramid, which contains different scales of the frame that is being processed.The system allows to set the number of scales, and the minimum and maximumscales, which are desing parameters for this step. Hence, we slide a fixed windowalong all of the scales in the multirresolution pyramid. This strategy allows to dealwith hand poses of different size. We have chosen this option, instead of sliding sev-eral windows with different sizes applied to a only scale, because the computationalcost is lower and the detector is easier to train.

In this way, a feature vector is extracted from each window position (while it isgoing through the multiresolution pyramid), and tested by the binary SVM classifier,determining if the region contains a hand pose or not. As a result, every window islabeled as a hand pose or background. Figure 3.3 shows the aforementioned strategyfor hands detection.

However, several overlapping windows belonging to the same pyramid level couldbe labeled as positive samples, since all of them could contain a significant fractionof the hand pose. This causes that the same hand pose is counted multiple times.This effect cannot be easily eliminated by considering a bigger training databasewith the purpose that the classifier only responds when the hand is exactly centeredin the window. In addition to this fact, it is a complex task that would require alarge amount of memory. Therefore, the usual strategy for managing this problem isto use a non-maxima suppression technique, which only selects those potential handregions presenting locally maxima high scores [67]. In our system, this techniqueselects a determined number of maxima according to two design parameters: athreshold and a radius [68].

The resulting windows after the non-maxima suppresion are potential hand re-gions. However, the problem persists among different scales of the multiresolutionpyramid. For this reason, we apply another non-maxima suppression technique thatreduces the number of overlapped according to an overlapping threshold, which is adesign parameter. If the overlapping of two windows is bigger than the overlappingthreshold, the algorithm only select the one which presents the highest score.

Once we obtain a set of filtered detections in a frame, they are used as input ofthe next stage, the tracking phase. Every detection can be seen as a vector whichcontains the coordinates of a reference point in the frame (in this case, the upperleft corner of the bounding box), the height and width of the bounding box, and theobtained score.

40

Page 63: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

Fig

ure

3.3:

Sta

ges

inth

edet

ecti

on

phase

.

41

Page 64: 2.2 Vision-based hand gesture representations

3.3. Hand tracking

3.3 Hand tracking

The goal of the tracking phase is to estimate temporal hand trajectories fromthe detected hand poses in every time step. As a result, one or several trajectoriesare generated depending on the number of detected hands in every frame. Thesetrajectories can be seen as a volume of hand regions, which will be analyzed in therecognition phase. Hand tracking is a mandatory step for our gesture recognitionsystem, because it is necessary to localize the target hand in every frame to computethe video descriptors for the gesture recognition.

Since there can be missing detections due to occlusions, strong changes in thehand appearance, and also false detections generated by background structures, theestimation of the hand locations can be inaccurate. The hand detection identitiescan be interchanged, as well, due to erroneous associations between detections andtrajectories. On the other hand, the computational cost of the tracking could in-evitably grow exponentially with the number of objects. In order to deal with theseproblems, we use a multiple object tracker which is robust to erroneous, distorted,and missing detections [69]. In this case, besides improving the accuracy, the com-putational cost is reduced since the complexity with the number of objects becomeslinear instead of exponential.

The general trackin procedure is as follows. When the first frame is processed inthe detection phase, the obtained bounding boxes are used as input detections of thetracker, which will create as many trajectories as the number of detected hands inthe frame. Every time that a frame is processed, the bounding boxes of the detectedhands are associated to their corresponding trajectories. Hence, a trajectory can beseen as a buffer where every component contains the hand locations for a specifictime step.

Since the global system has been thought in order to recognize gestures accom-plished with one hand, only one trajectory should be created. However, more thanone trajectory can appear due to false detections. These false detections should notpersist along the entire video sequence, producing very short trajectories. In thiscase, this tracker also allows us to remove this type of trajectories by imposing athreshold for the minimum trajectory length. However, if this problem persists, itwill also handle by the recognition phase.

3.4 Gesture recognition

The goal of the recognition stage is to temporally segment the video sequencesusing the information of the trajectories provided by the previous phase, and rec-ognize the dynamic hand gestures that are executed in it. Every video sequencecontains a series of spatially segmented hand poses, which can execute one or more

42

Page 65: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

dynamic hand gestures.

The recognition task consists of using a machine learning technique together witha video descriptor for feature extraction, as we can see in Figure 3.4. In this case,we use a set of five SVM classifiers since we have to deal with five classes. We usethe same kernel as in the detection phase (see Equation 3.1) to learn non-linear de-cision boundaries. On the other hand, the feature extraction in the spatio-temporaldomain is carried out by the VD-TS video descriptor, a novel proposal that will bedescribed in Section 4.3. The VD-TS descriptor is robust against dramatic illumina-tion changes, and allows to deal with variations in the execution of the hand gestures.In addition, they contain spatio-temporal information in an efficient and compactway that makes them highly discriminative regarding other video descriptors.

Figure 3.4: Block diagram of the hand-gesture recognition phase.

There are two main advantages in using a general classifier together with videofeature extraction techniques in the recognition phase, in comparison to other recog-nition techniques, such as automated-based models [48] [49]. On the one hand, itscomputational complexity is lower since it is not so dependent on the number of ges-tures to be recognized. On the other hand, if a new gesture needs to be recognized,it is only necessary to update the training database, instead of having to modify thewhole gesture model.

In order to determine where a specific dynamic hand gesture starts and ends,multiple sliding temporal windows scan the video sequence. The sliding windowsare overlapped. We consider that the overlapping magnitude between consecutivewindows is determined by the temporal step of the sliding window, which is a designparameter. Figure 3.5 shows this strategy, in which the sliding temporal windowscans the segmented video sequence extracted from the tracking phase. The choice ofthe temporal step is a trade-off between the computational cost and the accuracy ofthe recognition. The smaller the temporal step is, the larger the number of windowsis, so that the computational cost increases. However, the recognition accuracy ishigher as well, since the temporal windows can accurately tighten the dynamic handgesture.

The size of the temporal sliding window has been fixed taking into account theaverage length of the different dynamic hand gestures. Section 5.1.3 describes theexact estimation procedure.

43

Page 66: 2.2 Vision-based hand gesture representations

3.4. Gesture recognition

Figu

re3.5

:S

lidin

gw

indow

approa

chto

segmen

tdyn

am

ichan

dgestu

res.

44

Page 67: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

Since the video sequence is processed frame by frame, we cannot apply the tem-poral window until the number of hand poses in a trajectory is at least equal tothe size of the temporal window. Once a trajectory contains enough hand regionsto be processed, we compute spatio-temporal features which are called Volumet-ric Descriptors based on Temporal Sub-sampling (VD-TS), which will be presentedin detail in Section 4.3. Several different feature vectors are computed from thesame temporal window by applying this descriptor several times, taking advantageof its random temporal subsampling scheme. Thus, every computation of the videodescriptor produces different feature vectors, however they are correlated, that is,they belong to the same cluster in the feature space. This mechanism increasesthe recognition accuracy since it reproduces different slight variations in the handgesture execution.

As a result, we have a set of feature vectors that are all associated to the sametemporal window. Then, each one of them are individually classified as belonging toa specific class. Finally, using a voting scheme, the gesture is labeled as the most rec-ognized class, as shown Figure 3.6. In this way, we can choose the one which achievesthe best representation of the hand gesture in order to avoid missclassifications. Thisprocess is repeated for each trajectory (that defines an underlying segmented videosequence), providing enough hand regions samples to fill the temporal window.

The final step is a temporal validation of the predicted gesture (see Figure 3.7).We impose the condition that the same prediction should be consistent over a de-termined number of consecutive windows. The reason is that if the step size isenough small, the windows will differ in a few frames, containing the same gesture.This strategy solves potential errors due to gestures transitions and erroneous tra-jectories. This temporal validation windows also help filtering erroneous recognizedgestures that are not consistent in time (produced by erroneous trajectories). Thisstrategy also alleviates the use of a background class (or rejection class), which isharder to learn since there is a larger variability in the training samples for thisclass.

Therefore, in order to recognize a specific hand gesture, its duration must beat least slightly shorter than the window size, that is, the temporal window mustcontain a significant number of hand poses that represent the hand gesture.

3.5 System training

Both detection and recognition phases have to be trained to estimate the optimalparameter for the classifiers.

In the training stage of the detection phase, we estimate the optimal parametersof a binary SVM classifier. The positive training samples are hand poses and tran-sitions that can be part of the different dynamic hand gestures considered for this

45

Page 68: 2.2 Vision-based hand gesture representations

3.5. System training

Figure 3.6: Feature extraction and gesture recognition for a given temporal window.

46

Page 69: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

Fig

ure

3.7:

Sli

din

gw

indow

appro

ach

tova

lidate

apre

dic

tion

.

47

Page 70: 2.2 Vision-based hand gesture representations

3.6. Virtual mouse application

application (hand poses class). On the other hand, the negative training samplesare images containing background, and other type of gestures that do not belong tothe considered gestures (background class).

In the training stage of the recognition phase, we estimate the optimal parame-ters of five SVM classifiers, each one focusing on a different dynamic hand gesture(the classes in the database). The training samples are sequences that contain theconsidered five dynamic hand gestures. In order to generate a larger number oftraining samples, we apply the video descriptor to every training sequence severaltimes, taking advantadge of the random temporal subsampling scheme of the videodescriptor (see Section 4.3. Every computation of the video descriptor producesdifferent feature vectors that should be strongly correlated, that is, they should be-long to the same cluster in the feature space. Considering this fact, we generate ndifferent feature vectors from every sequence in the database.

Once all feature vectors are extracted, we train the SVM classifiers following aone-vs-all strategy. It consist of training each of the classifiers considering as positivesamples all the features vectors extracted from the class that we are training, andas negative samples the feature vectors corresponding to the rest of the classes.Therefore, we obtain five classifier models, which will be used in the recognitionphase in order to identify the performed gesture.

3.6 Virtual mouse application

Every time that a determined gesture is recognized after the recognition phase,the application triggers the associated mouse function. The flow chart in Figure3.8 describes the behaviour of the application. In order to initialize it, the systemmust recognize an ”ok” symbol. When the mouse mode is activated after the ”ok”,the system execute different actions depending on the recognized gesture. If thegesture is recognized as the ”fist” class, the application ends the mouse mode andthe system do not execute any action until a new ”ok” symbol is recognized.

48

Page 71: 2.2 Vision-based hand gesture representations

3. DESCRIPTION OF THE HAND-GESTURE RECOGNITION SYSTEM

Figure 3.8: Execution of the mouse functions based on the gesture predictions.

49

Page 72: 2.2 Vision-based hand gesture representations

3.6. Virtual mouse application

50

Page 73: 2.2 Vision-based hand gesture representations

Chapter 4

Image and video descriptors

4.1 Introduction

One of the main contribution of this thesis is the design of advanced imageand video descriptors to achieve reliable and compact representations from imagesand sequences containing hand gestures. Regarding image descriptors, we haveinspired on Local Binary Patterns, since they present very important and powerfulcharacteristics. However, the fact that they do not consider any spatial informationis a disadvantage in the case of hand gestures. Therefore, we have designed andimplemented two feature descriptors inspired on LBP that contain additional globalspatial information in order to be more discriminative. As for video descriptors,we have developed a new approach that efficiently embeds spatial and temporalinformation, improving other existing approaches in terms of power of discriminationand descriptor compacity. Some of the concepts used in the implementation areinherited by the previous design of the LBP-inspired image descriptor. All of themwill be explained in the following sections.

4.2 Image descriptors

In addition to be highly discriminative and computationally efficient, the featurevectors extracted from the appearance of a hand must present invariance againstdramatic illumination changes and slight translations. In this case, it is importantthat the feature vector is not invariant against rotations because a specific handgesture can have a very different meaning depending on its rotational position.

The LBP descriptor [55] has become very popular giving rise to a lot of newextensions [2] (see Section 2.4.1.3). One of the most interesting properties is itsscalability, as we mentioned in Section 2.4. If we make a further study of the ap-plications where this descriptor has been successful, we can find texture recognition

51

Page 74: 2.2 Vision-based hand gesture representations

4.2. Image descriptors

[70], face detection/recognition [71], and facial expression recognition [72], as themost popular ones. In the case of textures, they can be seen as a set of patternsthat recur several times. As for faces, mostly of them are formed by a uniformsurface (forehead skin, cheecks, and chin), and by four patterns that do not greatlychange their relative positions (two eyes, a nose, and a mouth), although everyelement is different among different people. In all these cases, the global spatial in-formation is not very determining since the appearance only can change in a limitedand controled way. Thus, the most important consideration is to know what typeof patterns form the image, and how many times they appear (i.e. the local spatialinformation).

However, in the case of a hand pose, we can not expect a set of patterns with fixedglobal locations, due to the fact that a hand is a deformable object with more than25 degrees of freedom. In addition, the appearance can change largely dependingon the observer’s viewpoint. For this reason, frontal face detectors achieve higherdetection rates than hand detectors. The hand patterns do not spread out uniformlyas in textures, nor they are located in specific areas of the image as in faces. Forhand poses, knowing what part of the image the patterns come from is as importantas knowing the type of patterns, and the number of times they appear. Therefore,our goal is to design a major extension of the LBP descriptor that contains spatialinformation. For this purpose, we propose two variants that are explained below.

4.2.1 Spatiograms of Local Binary Patterns

Spatiograms of Local Binary Patterns (S-LBP) are a mayor extension of the LocalBinary Patterns that includes global spatial information to be more discriminative.The idea behind S-LBP descriptor is to know from what part of the image everylocal binary pattern comes. To that end, we propose an algorithm that can bedivided into three steps (see Figure 4.1).

The first step consists of computing the Multi-scale LBP descriptor (LBPP,R)from the considered image region. This procedure was described in Section 2.4.1.3(see Equation 5.8). As a result, an image of local binary patterns is obtained, asshown Figure 4.1 (a). Then, we compute the histogram of Local Binary Patterns(H-LBP).

In the second step (see Figure 4.1 (b)), we extract the spatial information fromthe image of local binary patterns. To do that, we extract the coordinates of allthe LBP patterns that have contributed to a specific bin in the histogram H-LBP(representing a specific LBP type). Thus, a histogram of spatial coordinates pereach bin is computed. The range of spatial coordinates is quantized in order toshorten the lenght of these histograms, avoiding sparse histograms, and keeping thecomputational cost manageable. To that end, we do a uniform sub-sampling ofthe image region coordinates, obtaining a total of M ×N sub-sampled coordinates,defining M as the number of rows, and N as the number of columns. Therefore,

52

Page 75: 2.2 Vision-based hand gesture representations

4. IMAGE AND VIDEO DESCRIPTORS

(a) An image of LBP is obtained and the H-LBP is computed from it. (b) Each one of the

coordinates of an H-LBP bin contributes with different weights (bilinear interpolation) to a spatial

histogram. (c) The spatial histograms related to the H-LBP bins and the H-LBP itself are

concatenated to form the S-LBP descriptor.

Figure 4.1: Spatiograms of Local Binary Patterns.

53

Page 76: 2.2 Vision-based hand gesture representations

4.2. Image descriptors

the length of the resulting spatial histogram is M ×N , where each bin correspondsto a sub-sampled coordinate. The contribution of a specific LBP pattern in itscorresponding spatial histogram is computed using a bilinear interpolation approach.The bilinear interpolation approach increases the robutness against slight imagetranslations, and the image grid effect with respect to the more computationallysimple nearest neighbor approach [26]. As a result, we obtain 2P spatial histogramswhose length is M ×N , where P was the number of neighbors in the LBPP,R.

The third and last step can be seen in Figure 4.1 (c). It consists of concatenatingthe H-LBP itself along with the set of spatial histograms to form a super-descriptorcalled Spatiograms of Local Binary Patterns, whose dimension is 2P +[2P×(M×N)].For example, if we set P = 8 and M = N = 4, the resulting feature vector will have4352 components. In Figure 4.1, we can observe the steps of the S-LBP computation,where red points correspond to the sub-sampled coordinates, and the green point isthe location of a specific LBP pattern.

The S-LBP descriptor is highly discriminative since it contains both local spa-tial information (the H-LBP) and global spatial information (histograms of spatialcoordinates of all the LBP patterns). The sub-sampling used to quantized thespatial histograms allows to reduce the computational cost, establishing a tradeoffbetween the computational cost and the discrimination ability. The descriptor isrobust against slight translations thanks to the bilinear interpolation. In addition,the S-LBP descriptor is also robust against illumination changes, property that isinherited by the used of LBP structures in its computation. Table 4.1 summarizesthe qualitative information condensed in the S-LBP descriptor and its robutnessattributes.

Atribute Foundation

Robust against dramaticillumination changes

LBP is based on differences between pixels.

Robust against slighttranslations

Bilinear interpolation approach.

Discrimination powerLocal spatial information due to the LBP descriptor.Global spatial information due to the coordinate sub-sampling approach.

Table 4.1: Attributes of the S-LBP descriptor.

4.2.2 Local Binary Patterns based on Median and Pseudo-Covariance

Local Binary Patterns based on Median and Pseudo-Covariance (LBP-MPC)also adds spatial information but in a more compact way than in the previousapproach. This allows to shorten the length of the feature vector, and thus the

54

Page 77: 2.2 Vision-based hand gesture representations

4. IMAGE AND VIDEO DESCRIPTORS

memory requirements and computational cost. To that end, the second step in theprevious algorithm is modified, as shown Figure 4.2 (b).

The modification consists of calculating two descriptive statistics of every setof spatial coordinates obtained per each bin in the H-LBP, in order to describeand summarize how they are distributed. These statistics are the median and thepseudo-covariance. We have chosen the median instead of the more common meanvalue because it is not so strongly influenced by the skewed values, and is also lessaffected by outliers. On the other hand, we can define the pseudo-covariance as acovariance matrix that uses the median, instead of the mean, in its computation:

pCov =N∑i=1

(xi −M(x))(yi −M(y))

N − 1, (4.1)

where M(x) is the median of the set x.

Doing this for every set of coordinates, we obtain 2P spatial vectors whose lengthis five. The two first values corresponds to the median, and the other ones correspondto the three different values of the pseudo-covariance. Finally, the H-LBP itself andthe set of spatial vectors are concatenated to form the feature descriptor calledLocal Binary Patterns based on Median and Pseudo-Covariance (LBP-MPC), asshown Figure 4.2 (c), whose dimension is 2P + 5 × 2P = 6 × 2P . For example, ifwe set P = 8, the resulting feature vector will have 1536 components. Regardingthe previous approach, we have reduced the length of the descriptor in a 35, 3%.However, it is less discriminative since the spatial information is more compacted.

The original LBP descriptor also tried to overcome the problem of loosing spatialinformation by dividing the image region into several non-overlapped or overlappedblocks, and concatenating the resulting feature descriptors extracted from each oneof them [5], as we can observe in Figure 4.3. This approach can be interesting forus since it allows to be more precise when it comes to extract spatial information,and therefore, the resulting feature vector can be more discriminative. However,we have to keep in mind that the bigger is the number of blocks, the larger is thelength of the feature vectors, so that this technique would not be suitable for somepractical situations. This strategy can be more suitable for the LBP-MC descriptor,since the length of the feature vector in this case is shorter.

4.3 Video descriptor

We propose an extension of the developed image descriptors to handle the tem-poral information, called Volumetric Descriptors based on Temporal Sub-sampling(VD-TS). The key idea of the proposed temporal extension is a down samplingscheme in the time domain.

Generally a video sequence is composed by a number of ordered images or image

55

Page 78: 2.2 Vision-based hand gesture representations

4.3. Video descriptor

(a) An image of LBP is obtained and the H-LBP is computed from it. (b) The median and

pseudo-covariance is calculated for every set of spatial coordinates that contribute to a specific bin in

the H-LBP. (c) The spatial vectors containing the median and pseudo-covariance values are

concatenated with the H-LBP itself to form the LBP-MPC descriptor.

Figure 4.2: Local Binary Patterns based on Median and Pseudo-Covariance.

56

Page 79: 2.2 Vision-based hand gesture representations

4. IMAGE AND VIDEO DESCRIPTORS

Figure 4.3: LBP approach considering spatial information.

regions which describe a continuous movement in time. Images which are very closeto other ones in time hardly change its appearance, i.e. they are very similar. There-fore, it is not necessary to analyze all the images to identify the action that is beingperformed in the video sequence. In the case of an image sequence describing a handgesture, the images contain hand poses. So that, we can make a sub-sampling of thisimage sequence to select images with hand poses sufficiently separated to appreciatechanges in their appareance, and avoid redundant information. Hence, taking a suit-able number of sub-sampled images, extracting discriminative features from them,and concatenating all together, represent a dynamic hand gesture appropiately.

The algorithm to compute the proposed volumetric descriptors can be dividedinto three steps, as shown Figure 4.4. The first step performs a randomly and quasi-equally spaced sub-sampling of the image sequence, in such a way that every time itis applied, a different set of sub-sampled images is obtained. The way of confrontingthis strategy consists of taking images which would correspond to a equally spacedsub-sampling plus an additional random shifting, as shown Figure 4.5. For thispurpose, we first introduce two necessary parameters. On the one hand, we definethe temporal interval ∆e as the number of images between two consecutive equallyspaced sub-sampled images, whose mathematical expression is the following:

∆e = bLnc, (4.2)

where L is the sequence length, n is the number of images that are sub-sampled,and bxc is the largest integer not greater than x.

On the other hand, we define the maximum allowed shifting as δmax. It can bepositive or negative, that is, we can carry out a shifting to the right or to the leftrespect to a equally spaced sample. Taking into account that any shifting must bean integer number of images, and that we have to avoid taking very close consecutive

57

Page 80: 2.2 Vision-based hand gesture representations

4.3. Video descriptor

(a) A video sequence is sub-sampling. (b) Spatial features are extracted from each sample by appling

some image descriptor. (c) The resulting image descriptors are concatenated to form a

spatio-temporal feature descriptor.

Figure 4.4: Volumetric Descriptors based on Temporal Sub-sampling (VD-TS).

58

Page 81: 2.2 Vision-based hand gesture representations

4. IMAGE AND VIDEO DESCRIPTORS

samples, we consider δmax as the 25% of the temporal interval ∆e, that is:

δmax = b∆e

4c. (4.3)

The randomly and quasi-equally sub-sampling process starts by calculating thefirst sub-sampled image s0. To that end, an initial interval ∆0 is defined, from whichs0 can be drawn. The interval ∆0 goes from 1 (the first image in the video sequence)up to s0,max, whose mathematical expression is:

s0,max = L− δmax −∆e × (n− 1). (4.4)

In this way, we make sure that n samples are always taken for every sequence.The drawing is performed following a discrete uniform distribution over the consid-ered initial interval ∆0. As a result, we obtain the first temporal sample s0.

Once we have obtained the first sample, we can calculate the following samplesby the expression:

sk = s0 + k ×∆e + ∆shift, k = 1, ..., n− 1, (4.5)

where sk is the k− th sample, and ∆shift is a random shifting that follows a discreteuniform distribution. The range of ∆shift is [−δmax, δmax]. Thus, the k − th sub-sampled image is calculated as an initial shifting determined by the sample s0, plus ktimes the temporal interval ∆e (the one considering equally-spaced sampling), plusa random shifting defined by ∆shift.

As in the case of the s0 sample, we obtain different sub-sampled sequences ofimages every time that we apply the sub-sampling procedure, which simulates theexecution of a dynamic gesture with different slightly timings.

Once all the sub-sampled images have been obtained, the second step (see Figure4.4 (b)) consists of extracting spatial features from them by using one of the imagedescriptors explained in Section 4.2. Finally, the third step compute the resultingfeature vector (the video descriptor) by concatenating all the image descriptors,forming a volumetric feature vector that contain both spatial (local and global) andtemporal information.

The sub-sampling strategy is not only applied to reduce the number of images ina sequence, but also, it allows to deal with variations in the speed execution of thehand gestures. Moreover, the sub-sampling can be very useful in order to generateseveral training samples from the same image sequence, since every time we applyit, we obtain a different set of samples. On the other hand, from the viewpoint ofthe recognition stage, this approach is very powerful as well, since we can apply thesub-sampling several times to the same video sequence to obtain different featurevectors simulating different hand gestures with slight biases in the execution. Inthis way, we can test all of them, and select the one that reaches the highest score.

59

Page 82: 2.2 Vision-based hand gesture representations

4.3. Video descriptor

Figure 4.5: Image sequence sub-sampling and its parameters.

This is a versatil approach since it can use any image descriptor to compute thespatial features. But, we have to keep in mind the overall length of the final videodescriptor, which depends on the length of the image descriptor, and the number ofsub-sampled images. Therefore, the number of image samples must be appropiateto extract enough hand poses that represent correctly the dynamic hand gestures,while avoiding the creation of a very long video descriptor.

On the other hand, if we consider the different spatio-temporal variants of theLBP descriptor, that is, VLBP and LBP-TOP descriptors, we can find the sameproblem that had the LBP descriptor for describing hand poses, which did notconsider global spatial information. Now, the hand appearance not only changesdramatically in the spatial domain, but also changes in the temporal domain. Inaddition, a hand gesture performed by an individual can differ significantly from thesame hand gesture performed by other different individual. Therefore, the absentof localized patterns (global spatial information) results in a more complex problemthan before. Furthermore, the feature extraction process in VLBP and LBP-TOPis very slow, and therefore adding spatial information in these descriptors to obtainmore discriminative features, it would be prohibited in terms of computational cost,and handling the resulting high dimensionality.

60

Page 83: 2.2 Vision-based hand gesture representations

Chapter 5

Experimental results

In this chapter, we present the created visual database and the results for thedifferent stages of the global system. In Section 5.1, the database is presented. Wedescribe the scene, the different dynamic hand gestures that have been consideredfor the application, and the requirements that the database must fullfil. In Section5.2, we make a brief review of the different metrics used to measure the recognitionaccuracy and compare the results. Finally, the results are presented in Sections5.3, 5.4, and 5.5, for the detection phase, recognition phase and global system,respectively.

5.1 Database

5.1.1 Environment and scene description

Many techniques have already been proposed for gesture recognition in a specificenvironment using the cooperation of several sensors with the aim of improvingthe accuracy. Despite these strong restrictions, gesture recognition is still brittleand often depends on the position of the individuals relatively to the cameras. Inour study, we have used the Creative Senz-3D device [73], although other colorwebcams are possible. This device integrates two sensors to capture color and depthinformation respectively, however we only consider the color camera.

The considered scene try to be as realistic as possible, which means a standardenvironment with a non-uniform background, and other moving objects. The typicalscene structure is composed by an individual interacting with the computer, who isseated in front of a desk, where the computer is placed. The sensor is located in thetop of the screen, approximately 0.7 metres away from the individual who performsthe hand gestures (see Figure 5.1).

61

Page 84: 2.2 Vision-based hand gesture representations

5.1. Database

Figure 5.1: Environment and scene captured by the sensor.

5.1.2 Gesture vocabulary

In order to replace the traditional interface provided by the mouse device tointeract with computers, we propose a more natural and intuitive HCI interfacethat is based on the recognition of a series of hand gestures. It tries to simulatedifferent mouse functions depending on the recognized gesture. The main functionsof a mouse devide are three: the movement of the cursor, left click, and rightclick. Additionally, we propose two additional functions from the viewpoint of theapplication: mouse activation and mouse deactivation. So that, we consider fivedifferent hand gestures associated to each of the mouse functions, which are shownin Table 5.1. This table also contains the description about how to perform them,their identifier for the classfication stage, and some visual example.

The specific hand gesture is related to the meaning of the mouse functions. Weform the ok symbol to indicate that everything is ”ok” to start the mouse activation,that is, we are ready. And we close our hand to indicate that we want to ”close”the mouse activation. In order to simulate the cursor, we choose a hand pose thatallows us to point. When we point using our hand, we generally use the forefingerto indicate a direction (as an arrow). Simlarly, we propose to join all the fingers intoa point (another kind of arrow), because it is easier to discriminate regarding theother hand gestures. In addition, it is a pattern that can be more accurately locateddue to the multiple edge intersection of all the fingers in a narrow area (see Table5.1). Finally, the two dynamic gestures for performing the left click and the rightclick should fulfill two requirements: being highly discriminative regarding the otherhand gestures, and starting and finishing with the same hand pose as the cursor,allowing a natural flow of gestures. To that end, we propose to open and close thepalm for left click, and to move the forefinger and the middle finger up and downfor right click.

62

Page 85: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Notice that the hand gestures proposed for activating and deactivating the mouseare static, since they do not change along the time. However, in order to simplifythe system, we consider them as dynamic, taking into account that they have topersist a determined period of time to be recognized.

Following the previous guidelines, we have created a visual database containingseveral video sequences, in which 6 individuals (half of them are men and the otherhalf are women) perform different dynamic hand gestures from Table 5.1, in additionto other non-representative gestures and transitions. Each one of the individualshas been asked for recording six different video sequences. Regarding the first 30sequences (the first 5 sequences from the 6 individuals), every individual executes adifferent dynamic hand gesture in each one of them. This group of sequences is usedto extract different training samples for the detection and recognition phases. As aresult, we have two databases for the detection phase and the recognition phase.

Manually segmented image regions containing hand poses and transitions, whichare extracted from the five types of dynamic hand gestures, are used as trainingimages for the detection phase. To that end, we generate the ground truth from30 video sequences, extracting the regions of interest. Finally, the extracted imagesare separated into two classes: background and hand poses. On the other hand,manually segmented video sequences of hand poses (dynamic hand gestures) are usedfor the recognition phase training. The segmented video sequences have differentlengths, since each one of them represents a different dynamic hand gesture, andcan be even executed by distinct individuals at different speed. Finally, we separatethem into five classes depending on the hand gesture type.

Regarding the other 6 sequences (the last sequence from the 6 individuals) thatare not used for training, they are employed for testing purposes. In every testsequence, every individual executes all the dynamic hand gestures with the proposeto imitate an example of a natural sequence that a user would execute in the useof our application. First, the user executes the ”ok” symbol in order to activatethe mouse; next, he moves the cursor until he needs to do a left click; after that,the user continues moving the cursor until he needs to do a right click; finally,he deactivates the mouse by executing the fist hand gesture. An overview of thedatabase, containing some information about training samples and test sequences,is presented in Table 5.2.

The database is publicly available at the website www.gti.ssr.upm.es/data/.

63

Page 86: 2.2 Vision-based hand gesture representations

5.1. DatabaseM

ou

sefu

ncio

nH

an

dgest

ure

desc

rip

tion

Cla

ssID

Vis

ual

exam

ple

Mou

seact

ivat

ion

OK

sim

bol

ok

Moti

oncu

rsor

Poin

tin

gw

ith

all

fin

gers

join

edcu

rsor

Lef

tcl

ick

Fro

mth

ecu

rsor

pos

e,ex

ten

dan

dcl

ose

all

ofth

efi

nge

rsle

ftcl

ick

Rig

ht

clic

k

Fro

mth

ecu

rsor

pos

e,ex

ten

dan

dcl

ose

the

fore

fin

ger

and

the

mid

dle

fin

ger,

atth

esa

me

tim

e

righ

tcl

ick

Mou

sedea

ctiv

a-ti

on

Fis

tfi

st

Tab

le5.

1:

Pro

pose

dhan

dge

stu

res.

Des

crip

tion

an

dvi

sual

exam

ple

s.

64

Page 87: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Training

DetectionClass ID No of images

background 10248hand poses 10279

RecognitionClass ID No of sequences

ok 122cursor 118

left click 95right click 89

fist 124

Test

Sequence ID Gender Duration (frames)seq 1 F 528seq 2 M 979seq 3 M 642seq 4 M 764seq 5 F 682seq 6 F 709

Table 5.2: Database overview.

5.1.3 Requirements

In this section we present the requirements that the sample images and sequencesin the database must fullfil.

The feature vectors extracted from the images and videos in the test stage haveto be consistent with the feature vectors extracted from the training stage. To thatend, images with the same size have to be compared. For this purpose, every timewe apply the image descriptor, the image region is resized to have some specificdimensions. However, image regions can have a different aspect ratio, and resizingthem to have other dimensions will deform them. To avoid this problem, we fix theaspect ratio for all the images, and we select a specific image size to normalize theimage regions in the feature extraction step. In order to simplify the system andreduce the computational cost, we make these operations for the training images,and then we choose the sliding window dimensions accordingly. As a consequence,all the training images must fullfil that their aspect ratio is the same as the slidingwindow. The process to estimate the optimal aspect ratio for the training samplesand fix the dimensions of the sliding window is described as follows.

We consider a set of rectangular bounding boxes containing hand poses fromdifferent hand gestures executed by different individuals. Then, we calculate theaspect ratios, representing them by a histogram that is shown in Figure 5.2 (a). Sincethe resulting distribution is approximately unimodal and symmetric, we choose the

65

Page 88: 2.2 Vision-based hand gesture representations

5.2. Metrics

(a) Histogram of the aspect ratio from a set of bounding boxes containing different hand poses. (b)

Width dimensions from the same set.

Figure 5.2: Calculation of the sliding window dimensions.

mean value as the optimal aspect ratio, whose value is 0.75.

The next step is to set the dimensions of the sliding window. Taking the upperleft corner of the bounding box as the reference point, there are two options. If weconsider a larger width, the bounding box will contain more background, whereas ifwe consider a larger height, it will contain more wrist and arm area, as shown Figure5.3. The latter option is more desirable, since the wrist and the arm are very similarfor all of the individuals, and more uniform than the background. Next, we estimatethe optimal initial width of the sliding window, i.e. the size of the sliding windowcorresponding to the base level of the multirresolution pyramid. For this purpose,we generate a histogram of the width dimensions (see Figure 5.2 (b)), observingthat the distribution is approximately unimodal and asymmetric. For this reason,we have taken the median value, giving a value for the window width of 130 pixels.Finally, when creating the training database, we have to consider the aspect ratioequal to 0.75.

5.2 Metrics

5.2.1 Binary classification

In order to assess the accuracy of a classification problem, a confusion matrix isbuilt, since it identifies both the nature of the classification errors and their quanti-ties. A confusion matrix of size n× n shows the predicted and actual classification,

66

Page 89: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Figure 5.3: Setting the dimensions of the sliding window. The right option is preferable.

where n is the number of different classes. Table 5.3 shows the confusion matrix for abinary classification, where the input data is classified into one of two non-overlapingclasses. Each one of the entries have the following meanings:

• TP is the number of true positives. They represent the correct detections.

• FN is the number of false negatives. They are the number of missing detections(events/objects not detected).

• FP is the number of false positives. They represent the false detections.

• TN is the number of true negatives. They are non-detected events that corre-spond to none event.

ActualPositive Negative

PredictedPositive TP FPNegative FN TN

Table 5.3: The confusion matrix for two-class classification problem.

The most popular measures to quantify the results of a binary classificationproblem are the Precision and the Recall, which can be extracted from the confusionmatrix. The Precision metric corresponds to the number of correctly detected eventsdivided by the number of returned events, and can be calculated as:

Precision =True positives

True positives + False positives. (5.1)

On the other hand, the Recall metric corresponds to the correctly detected eventsdivided by the number of events in the ground truth. Its mathematical expression

67

Page 90: 2.2 Vision-based hand gesture representations

5.2. Metrics

is the following one:

Recall =True positives

True positives + False negatives. (5.2)

These two metrics can be combined in the F-score, a standard measure of theaccuracy of a test in statistics. This metric is used when several algorithms arebeing evaluated, because it provides a unique number evaluation metric, helping todecide which algorithm works better. The mathematical expression of the F-scorecan be expressed as follows:

F-score = 2 · Precision · Recall

Precision + Recall. (5.3)

Another common way to represent the performance of a binary classifier is bymeans of a ROC (Receiver Operating Characteristic) curve. In a ROC curve, thetrue positive rate (Recall) is plotted in function of the false positive rate for differentvalues of a decision threshold. If we see Figure 5.4 (a), we can observe that bothrates change depending on the position of the decision threshold. The False positiverate is calculated as:

False posite rate =False positives

False positives + True negatives. (5.4)

Figure 5.4: (a) Class distributions and decision threshold. (b) ROC curve.

These two rates are related to one another. When the rate of true positives isthe highest, the rate of false positives is the lowest, and vice-versa. Somewhere inthe middle they will reach equal proportions (see Figure 5.4 (b)). This fundamentaltradeoff vary in a different way depending on the classification problem and thedatabase, which makes the shape of the ROC curve a visual quality metric.

68

Page 91: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

5.2.2 Multi-class problem

In a multi-class classification problem, where the input data is classified intoone, and only one, of M non-overlapping classes, it is common to use the confusionmatrix itself as metric, because just taking a quick look at it, we can observe theresults of the classification. Table 5.4 shows the M ×M confusion matrix for a M-class classification problem. Along the first diagonal are the correct classifications,whereas all the other entries show missclassifications.

ActualClass 1 Class 2 ... Class M

Predicted

Class 1 a11 a12 · · · a1MClass 2 a21 a22 · · · a2M

......

.... . .

...Class M aM1 aM2 · · · aMM

Table 5.4: The confusion matrix for M-class classification problem.

However, there are some interesting rates that can be calculated from Table 5.4and are commonly used to measure the results of a multi-class problem. On the onehand, the Average accuracy metric, defined as the number of correct predictionsdivided by the number of total samples, which can be obtained as follows:

Average accuracy =

∑Mi=1 aii∑M

i=1

∑Mj=1 aij

. (5.5)

On the other hand, Precision and Recall metrics for every class can be calculatedas:

Precisionk =akk∑Mj=1 akj

(5.6)

Recallk =akk∑Mi=1 aik

, (5.7)

where k = 1, ...,M specifies the class.

If a one-vs-all strategy is carried out, M classifiers are obtained, and thereforewe can plot the ROC curve for each of them as well.

5.3 Evaluation of image descriptors for the detection phase

In this section, we evaluate the image descriptors that were presented in Section4.2, that is, the S-LBP and LBP-MPC descriptors, for the detection phase, and wecompare them with the LBP descriptor.

69

Page 92: 2.2 Vision-based hand gesture representations

5.3. Evaluation of image descriptors for the detection phase

For this purpose, we divide the detection (composed by background and handposes) into two sub-sets: a training set and a test set. The training set containsthe 80% of the samples, and the test set contains the other 20%. In this case, weare only testing isolated hand poses. Next, we train a binary SVM classifier withthe training set considering different parameters for the descriptors, which are thenevaluated by considering the test set.

The three descriptors have two parameters in common, which are the numberof neighbors P, and the radius of the neighborhood R. Since previous works haveproven that the best results for the LBPP,R descriptor are achieved by consideringR = 1 and P = 8 [2], and the multi-scale scheme in our descriptor designs is carriedout by a multirresolution gaussian pyramid, we do not try other values for these twoparameters.

Regarding the LBP and LBP-MPC descriptors, we consider another parameterin addition to the previous ones. It is the number of blocks in which we divide theanalyzed image region, num div (see Section 4.2). We are not interested in testinglarge numbers of divisions because of two reasons. In the first place, the larger thenum div parameter is, the bigger the length of the feature vector is, increasing thecomputational cost and memory requirements. In the second place, image handregions have a reduce area in comparison with the whole image, and therefore adivision of a few image blocks could represent the whole image quite well. Hence,we only consider from 1 to 4 blocks.

As for the S-LBP descriptor, in addition to the P and R parameters, we need tofit the number of samples per rows and columns, M and N respectively. For the Mand N parameters we test from 4 to 10 samples. No blocks division is considered tokeep the resulting vector dimensionally manageable.

We also test the regularization parameter, which is called C, for the SVM clas-sifier. Large values of C can cause overfitting, while smaller values of C can causea SVM model that is not able to separate the classes. We have tested a very widerange of values which goes from 0.01 to 100 taking steps of 0.3, in order to find thevalue which achieves the best performance.

Next, we present the obtained results for every image descriptor, considering theaforementioned ranges of values for the previous parameters.

Local Binary Patterns

Figure 5.5 shows the fraction of misclassified samples (y axis) for every con-figuration of parameters (x axis) for the LBP descriptor. Every configuration ofparameters comprises both the descriptor and the classifier parameters. We canobserve three clusters. The first one corresponds to not divide the image into blocks(num div = 1), the second one corresponds to divide the image into 2 × 2 blocks(num div = 2), and the third one into 3 × 3 and 4 × 4 blocks (num div = 3, 4). The

70

Page 93: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

different points in every cluster correspond to different values of the C parameterin increasing order.

Figure 5.5: Fraction of misclassified samples for different parameter configurations of theLBP descriptor.

Obviously, the best performance is achieved for a specific value of the C pa-rameter in every cluster. Table 5.5 summarizes for which sets of configurationsparameters the best scores are reached. The third cluster is the most interestingone since it contains the best results. Therefore, we extract more points from it inorder to observe how the metrics behave with the C parameter. As we expected,the best classification is achieved for num div = 4, since features are more spatiallydiscriminative. As for the C parameter, the optimal value is contained between20.71 and 24.91 (they all have almost similar scores).

71

Page 94: 2.2 Vision-based hand gesture representations

5.3. Evaluation of image descriptors for the detection phase

Descriptor parameters SVM parameter Metrics

P R num div C Precision Recall F-score

8 1 1 39.61 0.975 0.982 0.9785

8 1 2 66.31 0.990 0.995 0.9925

8 1 3 74.71 0.997 0.994 0.9955

8 1 3 5.41 0.994 0.997 0.9955

8 1 3 40.81 0.998 0.994 0.9960

8 1 4 59.71 0.994 0.997 0.9955

8 1 4 97.21 0.994 0.998 0.9960

8 1 4 24.01 0.995 0.998 0.9965

Table 5.5: Precision, Recall, and F-score metrics for different parameter configurations ofthe LBP descriptor (all the parameters have been already introduced in previous sections).

Local Binary Patterns based on Median and Pseudo-Covariance

The fraction of misclassified samples for every configuration of parameters forthe LBP-MPC descriptor is shown in Figure 5.6. It is very similiar to the previousone, since we can observe the same three clusters as well. The first one correspondsto not divide the image into blocks (num div = 1), the second one corresponds todivide the image into 2 × 2 blocks (num div = 2), and the third one into 3 × 3 and4 × 4 blocks (num div = 3, 4).

Table 5.6 shows the sets of configuration parameters that obtain the highestperformance in every cluster. In this case, the best classification score is achievedfor num div = 4 and, for a C parameter contained between 20.71 and 24.91.

Descriptor parameters SVM parameter Metrics

P R num div C Precision Recall F-score

8 1 1 3.31 0.974 0.984 0.9790

8 1 2 66.31 0.990 0.995 0.9925

8 1 3 85.21 0.996 0.994 0.9950

8 1 3 23.71 0.997 0.994 0.9955

8 1 3 40.81 0.998 0.994 0.9960

8 1 4 54.61 0.994 0.997 0.9955

8 1 4 3.31 0.994 0.998 0.9960

8 1 4 23.11 0.995 0.998 0.9965

Table 5.6: Precision, Recall, and F-score metrics for different parameter configurationsof the LBP-MPC descriptor (all the parameters have been already introduced in previoussections).

Comparing both tables (5.5 and 5.6) corresponding to the LBP and LBP-MPC

72

Page 95: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Figure 5.6: Fraction of misclassified samples for different parameter configurations of theLBP-MPC descriptor.

73

Page 96: 2.2 Vision-based hand gesture representations

5.3. Evaluation of image descriptors for the detection phase

descriptors, we observe that the results are very similar. Even more, the configura-tion for which the best performance is achieved, is the same.

The results can be interpreted as the addition of global spatial information usingthe median and pseudo-covariance is redundant for this specific classification task.The reason could be that the standard LBP already achieves a performance closeto perfection. For other classification tasks where the LBP does not behave so well,the difference with the LBP-MPC could be more appreciable.

Spatiograms of Local Binary Patterns

Excellent results have been obtained for most of the configurations of the S-LBPdescriptor. The misclassification rate is lower than 5.5 · 10−3 in all the cases, whichmeans that the minimum F-score that can be achieved is 0.9960. Table 5.7 showsdifferent sets of configuration parameters for which the best scores are reached. Inorder to check how the detection accuracy changes if we consider lower values for Mand N , we also test M = N = 2, which is the minimum possible value. From Table5.7, we can observe that if we set M = N = 2 the F-score value slightly decreaseswith respect to the other configurations.

Descriptor parametersSVM

Metricsparameter

P R M N C Precision Recall F-score

8 1 2 2 67.21 0.987 0.995 0.9910

8 1 4 4 21.01 0.998 0.998 0.9980

8 1 4 10 23.71 1 1 1

8 1 5 9 0.01 0.998 1 0.9995

8 1 8 8 5.71 0.999 1 0.9990

Table 5.7: Precision, Recall, and F-score metrics for different parameter configurations ofthe S-LBP descriptor (all the parameters have been already introduced in previous sections).

The obtained detection accuracy for every set of configuration parameters is verysimilar. This allows us to use a smaller number of samples M and N , achieving ahigh performance and reducing the computational cost.

In comparison with the previous image descriptors, i.e. the LBP and LBP-MPCdescriptors, the S-LBP descriptor achieves the highest score. The reason is thatthe addition of global spatial information using spatial histograms allows higherdiscrimination. Unlike the LBP-MPC descriptor, the S-LBP encodes better thespatial information, allowing to maintain the ”map” from which the LBP patternscome.

As a conclusion, the more appropiate image descriptor to use in the detection

74

Page 97: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

phase is the S-LBP descriptor. In particular, the best results are obtained using aconfiguration with M = 4 and N = 10, and considering a value of 23.71 for the Cparameter.

5.4 Evaluation of the recognition phase

In this section, we evaluate the video descriptor that was presented in Section4.3 for the recognition phase. We also compare it with the existing spatio-temporalextensions of the LBP descriptor, that is, the VLBP and LBP-TOP descriptors. Inaddition, we estimate the best values for the temporal window parameters. Finally,we evaluate the complete recognition phase, which integrates the video descriptorstogether with the temporal sliding window strategy.

5.4.1 Evaluation of video descriptors

First, we evaluate the three aforementioned video descriptors. For this purpose,we divide the recognition database (composed by five classes) into two sub-sets: atraining set and a test set. The training set contains the 80% of the samples, and thetest set contains the other 20%. In this case, we are only testing segmented videosequences that contain only one type of dynamic hand gesture. Next, we train a setof five SVM classifiers considering different parameters for the descriptors, whichwill be used to evaluate the test set.

The VLBP descriptor depends on three parameters, which are the temporal inter-val L, the number of neighbors P, and the radius of the neighborhood R (VLBPL,P,R).The original implementation of the VLBP descriptor is restricted to a value of P = 4,because of the computational cost. In addition, other works [5] have proven thatthe best results are achieved for the VLBP4,1,1, therefore we set R = 1. Regardingthe temporal interval, we test a range of values that goes from 1 to 4.

The LBP-TOPPXY ,PXT ,PY T ,RX ,RY ,RTdescriptor presents similar restrictions re-

garding the number of neighbors in the three planes XY, XT and YT, although wecan increase th value up to PXY = PXT = PY T = 8 in this case, since it is lesscomputationally demanding. In addition, other works [5] have proven that the bestresults are achieved for the LBP-TOP8,8,8,1,1,1. Therefore, we set 8 neighbors forevery plane, and RX = RY = 1, since they are the best values for extracting thelocal spatial structures. Regarding the radius in the temporal axis, we test a rangeof values that goes from 1 to 4. The reason is that the temporal variation of a videosequence strongly depends on the actions performed on it, and therefore there is nota clear value that could be used for all cases.

As for the VD-TS descriptor, the parameters are: the corresponding ones to thespatial feature extraction, which are defined by the used image descriptor (intraframe

75

Page 98: 2.2 Vision-based hand gesture representations

5.4. Evaluation of the recognition phase

encoding), and the parameters of the temporal sub-sampling procedure (interframeencoding). We use both LBP-MPC and S-LBP descriptors, considering their bestconfigurations (obtained in the previous section), as the image descriptors to beintegrated into the VD-TS approach. We also evaluate different sets of parametersfor the sub-sampling task of the video sequence. These parameters are the numberof sub-sampled images (num samples), and the number of times that is appliedthe sub-sampling stage to obtain different feature vectors per temporal window(num times).

Regarding the num samples parameter, the more number of samples, the betterthe representation of the dynamic hand gesture is. However, we have to keep inmind that the length of the resulting video descriptor (lengthvd) is proportional tothe length of the employed image descriptor (lengthid):

lengthvd = num samples · lengthid, (5.8)

and therefore the computational cost and memory requirements could be a strongrestriction.

We use a maximum of num samples = 5 for the S-LBP configuration due toits demanding memory requirements. Taking into account the nature of the twodynamic hand gestures, that is, the left click and the right click, 5 samples couldrepresent appropiately the spatial and temporal variation of these gestures. The twofirst samples could capture the progressive dynamics at the beginning, the followingsample could capture the intermediate state of the gesture, and the final two samplescould capture the regressive dynamics at the end. Given this value of temporalsampling, the maximum used values for both parameters M and N is 8 to avoidexceeding the computer memory. As for the LBP-MPC descriptor, we can use ahigher value for the num samples parameter, since the resulting descriptor length isshorter, and therefore it has less memory restrictions.

Regarding the num times parameter, we have use a value of 10, which is consid-ered an appropiate value to carry out the voting strategy, and improve the recogni-tion accuracy.

Similar to the image descriptor evaluations, we also test several values for theclassifier parameter C. In particular, we test the same range, that is, from 0.01 to100, taking steps of 0.3.

Next, we present the obtained results for every video descriptor, considering theaforementioned ranges of values for the previous parameters.

Volume Local Binary Patterns

Figure 5.7 shows the fraction of misclassified samples for every configuration ofparameters for the VLBP descriptor. Every configuration contains the descriptorand classifier parameters. We can observe four regions separated by a red line. The

76

Page 99: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

first one corresponds to a temporal step L = 1, the second one correspond to L = 2,the third one to L = 3, and the last one to L = 4. The different points in everycluster correspond to different values of the C parameter.

We can observe that the misclassification rates are very high in general. However,there are several sets of configuration parameters for which the recognition rates aresignificantly better. In particular, these sets corresponds to the first region, that is,for L = 1.

Figure 5.7: Fraction of misclassified samples for different parameter configurations of theVLBP descriptor.

Table 5.8 shows the parameter configurations for which the best results are ob-tained, together with other comparative configurations.

A further analysis that breaks down the recognition rate for every class is carriedout to detect anomaly behaviours. For this purpose, we use the confusion matrixand the ROC curve, which are shown in Figure 5.8. They are computed using theparameters that achieved the highest recognition accuracy, that is, for L = 1 andC = 0.31.

The values in the last column of the confusion matrix are the Precision values forevery class, the values in the last row are the Recall values for every class, and thebottom right cell shows the overall accuracy. Notice that some value could be NaN,meaning that there is a 0 in the denominator of the Precision or Recall equations.

77

Page 100: 2.2 Vision-based hand gesture representations

5.4. Evaluation of the recognition phase

Descriptor parameters SVM parameter Metrics

P R L C Accuracy

4 1 1 0.01 0.609

4 1 1 0.31 0.773

4 1 1 0.61 0.555

4 1 2 30.61 0.336

4 1 3 0.01 0.555

4 1 4 0.01 0.382

Table 5.8: Average accuracy for different parameter configurations of the VLBP descriptor(all the parameters have been already introduced in previous sections).

The different hand gestures or classes have been labeled in the confusion matrix asfollows: 1 for the ”cursor” class, 2 for the ”fist” class, 3 for the ”left click” class,4 for the ”ok” class, and 5 for the ”right click” class. The ROC curve is a morevisual representation that allows us to identify more precisely the involved classesand estimate the magnitude of the recognition.

As we can observe, the ”fist” class is completely classified as the ”cursor” class,while the rest of the classes are correctly classified. That means that the Precisionand Recall of the ”fist” class have 0% value, and the Precision and Recall of the”cursor” class have values of 100% and 49%, respectively, as we can see in the confu-sion matrix. This thresholding behaviour is showed in the ROC curves. Dependingon the decision threshold, all the ”cursor” and ”fist” samples are always classifiedas ”cursor” class, or they are never classified as ”cursor” class. Although the restof classes are correctly classified, we can not allow that a entire class is missclassi-fied. As conclusion, the features extracted by the VLBP descriptor are not enoughrepresentative to discriminate the ”fist” and ”cursor” classes.

Local Binary Patterns from Three Orthogonal Planes

Figure 5.9 shows the fraction of missclassified samples for every configuration ofparameters for the LBP-TOP descriptor. Likewise the previous case, we can observefour regions corresponding to RT = 1, 2, 3, 4, respectively. We can observe that themisclassification rates are very high, and worse than for the VLBP descriptor (asexpected). The best recognition rates are achieved for RT = 1 in most of thecases, and for RT = 2 in two exceptional cases. Table 5.9 shows the configurationscorresponding to the parameter configurations that achieve the best results for everytemporal step RT .

In the confusion matrix of Figure 5.10, we can observe that the reason for theso low recognition accuracy is that there are two classes that are being totallymisclassified. In particular, the ”cursor” class is classified as the ”right click” class,

78

Page 101: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Fig

ure

5.8:

Con

fusi

on

matr

ixan

dR

OC

curv

efo

rth

ebe

stpa

ram

eter

con

figu

rati

on

of

the

VL

BP

des

crip

tor.

79

Page 102: 2.2 Vision-based hand gesture representations

5.4. Evaluation of the recognition phase

Figure 5.9: Fraction of misclassified samples for different parameter configurations of theLBP-TOP descriptor.

80

Page 103: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Descriptor parameters SVM parameter Metrics

PXY PXT PY T RX RY RT C Accuracy

8 8 8 1 1 1 75.01 0.555

8 8 8 1 1 2 41.11 0.555

8 8 8 1 1 3 80.41 0.382

8 8 8 1 1 4 0.01 0.382

Table 5.9: Average accuracy for different parameter configurations of the LBP-TOP de-scriptor (all the parameters have been already introduced in previous sections).

and the ”fist” class is classified as the ”left click” class. The rest of the classesare correctly classified, however, the Precision of the ”left click” and ”right click”classes is affected by the misclassification of the ”cursor” and ”fist” classes, as wecan see in the confusion matrix.

We can observe this behaviour in the ROC curves. Depending on the deci-sion threshold, both ”cursor” and ”right click” classes are always classified as the”right click” class, or they are never classified as ”right click”, since they representthe same class for the classifier. The same situation happens for the ”left click” and”fist” classes.

The LBP-TOP descriptor was designed for solving the dimensionality problemof the VLBP descriptor. However, this design is based on decreasing the length ofthe feature vector by reducing the information. For this reason, the results for theLBP-TOP descriptor are worse than for the VLBP descriptor. The dimensionalityproblem is solved, but the extracted features are much less discriminative, causingbad results.

Video Descriptor based on Temporal Sub-sampling

Table 5.10 shows the obtained results for the VD-TS approach in combinationwith the best parameter configurations for the S-LBP and LBP-MPC image descrip-tors. The employed notation for the LBP-MPC descriptor is LBP-MPCP,R,num div ,and for the S-LBP descriptor is S-LBPP,R,M,N. We can observe that all the considereddescriptors reach very good recognition rates.

For S-LBP we test 4 and 5 samples to carry out the sub-sampling process. Wewant to check how the recognition accuracy changes if we consider a lower numberof samples. For LBP-MPC we can use a larger number of samples, since its lengthis much more manageable. In particular we test 4, 8, and 10. As we increase thenumber or samples for the LBP-MPC, the obtained performance is more similar tothe performance of the S-LBP descriptor with a lower num samples value. However,the performance continue being superior for the S-LBP. We can observe, that the

81

Page 104: 2.2 Vision-based hand gesture representations

5.4. Evaluation of the recognition phase

Fig

ure

5.10:C

on

fusio

nm

atrix

an

dR

OC

curve

for

the

bestpa

ram

eterco

nfi

gura

tion

of

the

LB

P-T

OP

descrip

tor.

82

Page 105: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

best recognition accuracy is achieved when using the S-LBP8,1,4,10 descriptor, and 5samples to carry out the sub-sampling process.

Descriptor parameters SVM parameter Metrics

Image descriptor num samples C Accuracy

LBP −MPC8,1,1,4 4 3.91 0.948

LBP −MPC8,1,1,4 8 3.31 0.967

LBP −MPC8,1,1,4 10 6.61 0.986

S − LBP8,1,4,10 4 2.11 0.984

S − LBP8,1,4,10 5 0.61 1

S − LBP8,1,8,8 4 77.71 0.985

S − LBP8,1,8,8 5 89.41 0.995

Table 5.10: Average accuracy for different parameter configurations of the VD-TS descrip-tor (all the parameters have been already introduced in previous sections).

Figure 5.11 shows the confusion matrix and the ROC curve for the VD-TS de-scriptor with 5 samples and combined with the S-LBP8,1,4,10 descriptor. Since themaximum recognition accuracy has been achieved, all the gestures have been cor-rectly classified, and the Precision and Recall values are of the 100% for all theclasses.

In comparison with the VLBP and LBP-TOP descriptors, the VD-TS approachtogether with the proposed image descriptors, is much more discriminative. Inparticular, it reaches a recognition accuracy of approximately 20% higher than theVLBP descriptor, and 40% higher than the LBP-TOP descriptor. The reason bywhich the VD-TS descriptor is superior to the VLBP and LBP-TOP is the specificcombination of global information and temporal information. Furthermore, thefeature extraction process is significantly slower for the VLBP and LBP-TOP thanfor the VD-TS.

Therefore, the more appropiate video descriptor to use in the recognition phaseis based on the VD-TS approach with 5 samples, the S-LBP descriptor with M = 4and N = 10, and a C parameter equal to 0.61.

5.4.2 Evaluation of the temporal window

The followed strategy to temporally segment the dynamic hand gestures consistof scanning the segmented video sequence by a sliding temporal window (see Section3.4). The size of the temporal window is fixed and must be estimated. In order toestimate its best value, we have to keep in mind that every individual can executesthe dynamic hand gestures with different speed, and every type of gesture has aspecific length. Therefore, we have made a study to get the optimal size of the

83

Page 106: 2.2 Vision-based hand gesture representations

5.4. Evaluation of the recognition phase

Fig

ure

5.11:

Con

fusio

nm

atrix

an

dR

OC

curve

for

the

bestpa

ram

eterco

nfi

gura

tion

of

the

VD

-TS

descrip

tor.

84

Page 107: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

temporal window.

On the one hand, there are only two extrictly dynamic hand gestures: the ”leftclick” and ”right click”, which have a beginning and an end. Hence, we couldestimate the size of the sliding window as the average value of the length of thesetwo gestures. The other gestures have a static nature, and therefore the ”ok” symboland the ”fist” can be segmented in any temporal instant, since their appearance donot change in time (the extracted features will be almost the same in any temporalwindow). As for the cursor gesture, we can consider it as static, although with anassociated trajectory.

We have analyzed six video sequences considering different sizes for the slidingwindow. In every sequence, an individual executes several dynamic hand gestures,spanning all the types. Table 5.11 shows the best values for the size of the temporalwindow in every sequence, achieving the best recognition accuracy.

In order to avoid outliers, that is, those cases in which individuals execute a handgesture with a very different speed from the other ones, we choose the median valuefor the size of the temporal window. In this case, this value is 24.

seq 1 seq 2 seq 3 seq 4 seq 5 seq 6

size 24 24 24 39 29 23

Table 5.11: Optimal size of the sliding temporal window for different video sequences.

5.4.3 Integration of the video descriptor and the temporal window

Once we have evaluated the video descriptors, the classifiers, and the size of thetemporal window, we integrate both feature extraction and temporally segmentationstages together, to carry out the recognition task.

Next, we calculate the recognition accuracy for every test sequence. For thispurpose, we consider the video descriptors (and associated parameters) that haveachieved the best results in the previous study (see Section 5.4.1). They are V LBP4,1,1

with a C parameter equal to 0.31, LBP − TOP8,8,8,1,1,1 with C = 75.01, and VD-TS combined with the image descriptor S − LBP8,1,4,10, considering C = 0.61, andnum samples = 5. On the other hand, we set 24 for the size of the temporal win-dow. Table 5.12 shows the recognition accuracy for every test sequence and everyvideo descriptor.

As we can see, the introduction of the sliding temporal window has degradedthe recognition accuracy for every descriptor. On the one hand, since every testsequence contains the continuous execution of sereral dynamic hand gestures, thereare temporal windows that contain the end of a gesture and the beginning of the

85

Page 108: 2.2 Vision-based hand gesture representations

5.5. Overall system results

V LBP4,1,1 LBP − TOP8,8,8,1,1,1 VD-TS

seq 1 0.720 0.507 0.949

seq 2 0.765 0.521 0.961

seq 3 0.711 0.535 0.935

seq 4 0.695 0.487 0.923

seq 5 0.689 0.485 0.897

seq 6 0.770 0.510 0.959

Table 5.12: Recognition results for the different considered video descriptors.

following one (transitions). Since we have not trained a rejection class for the recog-nition phase, these transition windows have to be incorrectly labeled as one of thefive considered classes. However, we have implemented the validation predictionstrategy to deal with this problem. On the other hand, the fact that the size ofthe temporal window is fixed degrade the recognition accuracy for the individualswhose execution speed is not adecuated for the size of the temporal window.

We can conclude from observing Table 5.12 that the individuals in the test se-quences 4 and 5, execute the dynamic hand gestures significatly faster or slower thanthe individuals in the rest of sequences. Therefore, the sliding temporal window doesnot fit so appropiate for them.

Figure 5.12 shows the confusion matrix and the ROC curves for the video se-quence that has obtained the higher performance, that is, for the sequence 2. Mostof the misclassified hand gestures are classified as the ”cursor” class. In addition, aswe expected, the worse classified hand gestures are the left click and the right click,especially the left click. However, the confusion among the gestures can be differentfor every individual.

5.5 Overall system results

In this section, we present the results for the global system, integrating thedetection, tracking and recognition phases. To that end, we consider the image andvideo descriptors, and the neccesary parameters for every phase, which achieved thebest results in the previous studies.

The global system has been tested by considering simulated noisy detections.The reason is the following one. The detection phase must be able to detect a set ofspecific hand poses within a scene, whose background is uncontrolled. In this case,we have to train our system considering two classes: background and hand poses.The latter is a controlled class that can be easily trained, because the range ofpossible samples is limited. However, the background class is uncontrolled becauseit has a huge variability in their samples (any pattern could appear as background).

86

Page 109: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Fig

ure

5.12

:C

on

fusi

on

matr

ixan

dR

OC

curv

esfo

rth

evi

deo

sequ

ence

that

has

obt

ain

edth

ebe

stre

cogn

itio

nra

te.

87

Page 110: 2.2 Vision-based hand gesture representations

5.5. Overall system results

In order to confront this problem, it would be necessary to train the system with avery large variety of background samples. However, this solution is out of our reachfor one person, since creating a enough large database would require a lot of time.

This fact contrast with the positive evaluation of the image descriptors (seeSection 5.3). The reason is that the training database were composed by sequencesthat had a unique background, and therefore it was easily characterized by a reducedset of samples.

However, when evaluating the image descriptors considering image regions fromdifferent backgrounds, the system is not properly trained to achieve a satisfactoryclassification.

The simulation of the noisy detections consists of extracting the detections fromthe ground-truth and modifying them by considering small and random shiftingsfrom their centroids.

The specific configuration for the whole system uses the VD-TS descriptor com-bined with the S-LBP descriptor,and with the parameters shown in Table 5.13.

P R M N num samples C

8 1 4 10 5 0.61

Table 5.13: Parameters for the recognition phase.

The evaluation of the whole hand-gesture recognition system is carried out bytesting six sequences. In addition, we have broken down the recognition rate forevery class, allowing us to calculate the mean value of the recognition accuracyfor every class. Thus, we can draw several conclusions about every type of ges-ture. Finally, we calculate the global recognition accuracy as the mean value of therecognition rate for every class. We present the result in Table 5.14.

The average recognition accuracy that we have obtained for the global system is0.8026. It represent the mean value of the recognition rates for every type of handgesture.

On the one hand, as we expected, the hand gestures with a static nature haveobtained the best performances, that is, the ”cursor”, ”fist”, and ”ok” classes. The”left click” and ”right click” classes present a lower recognition rate. Since thesehand gestures have a complex dynamics, the variance of these classes is larger thanfor static gestures. It would be possible to improve their recognition rates, andtherefore the global recognition accuracy, by training the system with more samples.

Sequences 4 and 5 for the ”fist” and ”left click” classes, respectively, have beenremoved. The individuals of these sequences execute incorrectly the aforementionedhand gestures, or with an abnormal speed. For this reason, we do not take into

88

Page 111: 2.2 Vision-based hand gesture representations

5. EXPERIMENTAL RESULTS

Class Sequence IDSequence

Class accuracy Global accuracyaccuracy

cursor

seq 1 1

1

0.8026

seq 2 1seq 3 1seq 4 1seq 5 1seq 6 1

fist

seq 1 0.783

0.8082seq 2 0.878seq 3 0.731seq 4 0.824seq 6 0.825

left click

seq 1 0.679

0.5970seq 2 0.485seq 3 0.556seq 5 0.709seq 6 0.556

ok

seq 1 0.906

0.895

seq 2 0.867seq 3 0.806seq 4 0.906seq 5 0.955seq 6 0.931

right click

seq 1 0.741

0.713

seq 2 0.711seq 3 0.686seq 4 0.565seq 5 0.768seq 6 0.808

Table 5.14: Global recognition accuracy.

89

Page 112: 2.2 Vision-based hand gesture representations

5.5. Overall system results

account their corresponding recognition rates.

In conclusion, the obtained recognition accuracy of the global system is quitehigh, allowing to use the developed hand-gesture recognition system for real HCIapplications.

90

Page 113: 2.2 Vision-based hand gesture representations

Appendix A

Contributions

Publications

A paper entitled ”Human-Computer Interaction Based on Visual Recognitionusing Volumegrams of Local Binary Patterns” has been sent to the InternationalConference on Consumer Electronics (ICCE).

It presents a robust hand-gesture recognition system, which allows a more naturalinput interface for simulating a mouse. They key element of the system is a noveland highly discriminative video descriptor called Volumetric Spatiograms of LocalBinary Patterns (VS-LBP).

Currently, the paper is under review.

Databases

In order to carry out this thesis, a visual database has been created. It containsseveral video sequences, in which different individuals execute a set of dynamic handgestures. The considered gestures are designed to control the different functionalitiesof a mouse-like pointing application.

The database is publicly available at the website www.gti.ssr.upm.es/data/.

Software

In order to implement the prototype of the hand-gesture recognition system, arecognition library has been created. All the code has been implemented in Matlab,but the library also contain external functions implemented in C.

To access the library, contact with [email protected] or [email protected].

91

Page 114: 2.2 Vision-based hand gesture representations

92

Page 115: 2.2 Vision-based hand gesture representations

Bibliography

[1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Inter-national journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[2] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and ro-tation invariant texture classification with local binary patterns,” IEEE Trans-actions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 971–987, Jul2002.

[3] G. Zhao, T. Ahonen, J. Matas, and M. Pietikainen, “Rotation-invariant imageand video description with local binary pattern features,” IEEE Transactionson Image Processing, vol. 21, pp. 1465–1477, April 2012.

[4] Z. Guo, D. Zhang, and D. Zhang, “A completed modeling of local binary patternoperator for texture classification,” IEEE Transactions on Image Processing,vol. 19, pp. 1657–1663, June 2010.

[5] G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binarypatterns with an application to facial expressions,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 29, pp. 915–928, June 2007.

[6] P. Ekman and W. V. Friesen, “The repertoire of nonverbal behavior: Categories,origins, usage, and coding,” Nonverbal communication, interaction, and gesture,pp. 70–104, 1981.

[7] N. Kevin, S. Ranganath, and D. Ghosh, “Trajectory modeling in gesture recog-nition using cybergloves R©; and magnetic trackers,” in IEEE Region 10 Con-ference TENCON, vol. A, pp. 571–574 Vol. 1, Nov 2004.

[8] T. Schlomer, B. Poppinga, N. Henze, and S. Boll, “Gesture recognition with awii controller,” in International Conference on Tangible and embedded interac-tion, pp. 11–14, ACM, 2008.

[9] L. Bretzner, I. Laptev, and T. Lindeberg, “Hand gesture recognition usingmulti-scale colour features, hierarchical models and particle filtering,” in IEEEInternational Conference on Automatic Face and Gesture Recognition, pp. 423–428, May 2002.

93

Page 116: 2.2 Vision-based hand gesture representations

Bibliography

[10] S. Ju, M. Black, S. Minneman, and D. Kimber, “Analysis of gesture and actionin technical talks for video indexing,” in IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pp. 595–601, Jun 1997.

[11] X. Zabulis, H. Baltzakis, and A. Argyros, “Vision-based hand gesture recogni-tion for human-computer interaction,” The Universal Access Handbook. LEA,2009.

[12] M. Cote, P. Payeur, and G. Comeau, “Comparative study of adaptive segmen-tation techniques for gesture analysis in unconstrained environments,” in IEEEInternational Workshop on Imagining Systems and Techniques, pp. 28–33, April2006.

[13] T. Kurata, T. Okuma, M. Kourogi, and K. Sakaue, “The hand mouse: Gmmhand-color classification and mean shift tracking,” in IEEE ICCV Workshopon Recognition, Analysis, and Tracking of Faces and Gestures in Real-TimeSystems, pp. 119–124, 2001.

[14] D. Chai and K. Ngan, “Face segmentation using skin-color map in videophoneapplications,” IEEE Transactions on Circuits and Systems for Video Technol-ogy, vol. 9, pp. 551–564, Jun 1999.

[15] M. Jones and J. Rehg, “Statistical color models with application to skin detec-tion,” in IEEE Computer Society Conference on Computer Vision and PatternRecognition, vol. 1, pp. 275–280 Vol. 1, 1999.

[16] M.-H. Yang and N. Ahuja, “Gaussian mixture model for human skin color andits application in image and video databases,” in SPIE Storage and Retrievalfor Image and Video Databases VII, vol. 3656, pp. 458–466, 1999.

[17] S. L. Phung, D. Chai, and A. Bouzerdoum, “A universal and robust humanskin color model using neural networks,” in International Joint Conference onNeural Networks, vol. 4, pp. 2844–2849 vol.4, 2001.

[18] S. Alvarez, D. F. Llorca, G. Lacey, and S. Ameling, “Spatial hand segmenta-tion using skin colour and background subtraction,” Trinity College Dublin’sComputer Science Technical Report, Dublin, 2010.

[19] R. Lionnie, I. Timotius, and I. Setyawan, “An analysis of edge detection as afeature extractor in a hand gesture recognition system based on nearest neigh-bor,” in International Conference on Electrical Engineering and Informatics,pp. 1–4, July 2011.

[20] N. Dardas and N. D. Georganas, “Real-time hand gesture detection and recog-nition using bag-of-features and support vector machine techniques,” IEEETransactions on Instrumentation and Measurement, vol. 60, pp. 3592–3607,Nov 2011.

94

Page 117: 2.2 Vision-based hand gesture representations

BIBLIOGRAPHY

[21] V. Athitsos, J. Wang, S. Sclaroff, and M. Betke, Detecting instances of shapeclasses that exhibit variable structure. Springer, 2006.

[22] J. Coughlan, A. Yuille, C. English, and D. Snow, “Efficient deformable templatedetection and localization without user initialization,” Computer Vision andImage Understanding, vol. 78, no. 3, pp. 303–319, 2000.

[23] P. Trigueiros, A. F. Ribeiro, and L. P. Reis, “A comparative study of differentimage features for hand gesture machine learning,” 2013.

[24] M. El-gayar, H. Soliman, et al., “A comparative study of image low level featureextraction algorithms,” Egyptian Informatics Journal, vol. 14, no. 2, pp. 175–181, 2013.

[25] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-tion,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1,pp. 886–893 vol. 1, June 2005.

[26] D. Lowe, “Object recognition from local scale-invariant features,” in IEEE In-ternational Conference on Computer Vision, vol. 2, pp. 1150–1157 vol.2, 1999.

[27] P. Viola and M. J. Jones, “Robust real-time face detection,” International jour-nal of computer vision, vol. 57, no. 2, pp. 137–154, 2004.

[28] C.-C. Wang and K.-C. Wang, “Hand posture recognition using adaboost withsift for human robot interaction,” in Recent progress in robotics: viable roboticservice to human, pp. 317–329, Springer, 2008.

[29] T. Heap and D. Hogg, “Towards 3d hand tracking using a deformable model,” inInternational Conference on Automatic Face and Gesture Recognition, pp. 140–145, Oct 1996.

[30] Z. Zhang, R. Alonzo, and V. Athitsos, “Experiments with computer visionmethods for hand detection,” in International Conference on Pervasive Tech-nologies Related to Assistive Environments, p. 21, ACM, 2011.

[31] Q. Yuan, S. Sclaroff, and V. Athitsos, “Automatic 2d hand tracking in videosequences,” in IEEE Workshops on Application of Computer Vision, vol. 1,pp. 250–256, Jan 2005.

[32] A. A. Argyros and M. I. Lourakis, “Real-time tracking of multiple skin-coloredobjects with a possibly moving camera,” in Eurpoean Conference on ComputerVision, pp. 368–379, Springer, 2004.

[33] T. F. Cootes and C. J. Taylor, “Active shape models—‘smart snakes’,” inBritish Machine Vision Conference, pp. 266–275, Springer, 1992.

[34] R. E. Kalman, “A new approach to linear filtering and prediction problems,”Journal of Fluids Engineering, vol. 82, no. 1, pp. 35–45, 1960.

95

Page 118: 2.2 Vision-based hand gesture representations

Bibliography

[35] N. D. Binh, E. Shuichi, and T. Ejima, “Real-time hand tracking and gesturerecognition system,” Conference on Graphics, Vision and Image Processing,pp. 19–21, 2005.

[36] J. LaViola, “A comparison of unscented and extended kalman filtering for esti-mating quaternion motion,” in American Control Conference, vol. 3, pp. 2435–2440 vol.3, June 2003.

[37] B. Stenger, P. R. Mendonca, and R. Cipolla, “Model-based hand tracking usingan unscented kalman filter.,” in British Machine Vision Conference, vol. 1,pp. 63–72, 2001.

[38] S. Maskell and N. Gordon, “A tutorial on particle filters for on-linenonlinear/non-gaussian bayesian tracking,” in Target Tracking: Algorithms andApplications, vol. Workshop, pp. 2/1–2/15 vol.2, Oct 2001.

[39] M. Isard and A. Blake, “Condensation—conditional density propagation forvisual tracking,” International journal of computer vision, vol. 29, no. 1, pp. 5–28, 1998.

[40] J. P. Mammen, S. Chaudhuri, and T. Agrawal, “Simultaneous tracking of bothhands by estimation of erroneous observations.,” in British Machine VisionConference, pp. 1–10, 2001.

[41] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid ob-jects using mean shift,” in IEEE Conference on Computer Vision and PatternRecognition, vol. 2, pp. 142–149 vol.2, 2000.

[42] G. R. Bradski, “Computer vision face tracking for use in a perceptual userinterface,” 1998.

[43] J. Yin, Y. Han, J. Li, and A. Caoc, “Research on real-time object trackingby improved camshift,” in International Symposium on Computer Network andMultimedia Technology, pp. 1–4, Jan 2009.

[44] M. Kumar, M. K. Jindal, and R. K. Sharma, “k-nearest neighbor based offlinehandwritten gurmukhi character recognition,” in International Conference onImage Information Processing, pp. 1–4, Nov 2011.

[45] C. J. Burges, “A tutorial on support vector machines for pattern recognition,”Data mining and knowledge discovery, vol. 2, no. 2, pp. 121–167, 1998.

[46] Y. Liu, Z. Gan, and Y. Sun, “Static hand gesture recognition and its applica-tion based on support vector machines,” in International Conference on Soft-ware Engineering, Artificial Intelligence, Networking, and Parallel/DistributedComputing, pp. 517–521, Aug 2008.

96

Page 119: 2.2 Vision-based hand gesture representations

BIBLIOGRAPHY

[47] B.-W. Min, H.-S. Yoon, J. Soh, Y.-M. Yang, and T. Ejima, “Hand gesturerecognition using hidden markov models,” in IEEE International Conferenceon Systems, Man, and Cybernetics, vol. 5, pp. 4232–4235 vol.5, Oct 1997.

[48] M. Gharasuie and H. Seyedarabi, “Real-time dynamic hand gesture recognitionusing hidden markov models,” in Iranian Conference on Machine Vision andImage Processing, pp. 194–199, Sept 2013.

[49] A. Corradini, “Dynamic time warping for off-line recognition of a small ges-ture vocabulary,” in IEEE Workshop on Recognition, Analysis, and Trackingof Faces and Gestures in Real-Time Systems, pp. 82–89, 2001.

[50] P. Modler and T. Myatt, “Recognition of separate hand gestures by time-delayneural networks based on multi-state spectral image patterns from cyclic handmovements,” in IEEE International Conference on Systems, Man and Cyber-netics, pp. 1539–1544, Oct 2008.

[51] P. Hong, M. Turk, and T. Huang, “Gesture modeling and recognition usingfinite state machines,” in IEEE International Conference on Automatic Faceand Gesture Recognition, pp. 410–415, 2000.

[52] Y. Ding, H. Pang, X. Wu, and J. Lan, “Recognition of hand-gestures using im-proved local binary pattern,” in International Conference on Multimedia Tech-nology, pp. 3171–3174, July 2011.

[53] M. Yao, H. Zhu, Q. Gu, L. Zhu, and X. Qu, “Sift-based algorithm for objectmatching and identification,” in International Conference on Remote Sensing,Environment and Transportation Engineering, pp. 5317–5320, June 2011.

[54] P. Bilinski, F. Bremond, and M. Kaaniche, “Multiple object tracking with oc-clusions using hog descriptors and multi resolution images,” in InternationalConference on Crime Detection and Prevention, pp. 1–6, Dec 2009.

[55] T. Ojala, M. Pietikainen, and D. Harwood, “A comparative study of texturemeasures with classification based on featured distributions,” Pattern recogni-tion, vol. 29, no. 1, pp. 51–59, 1996.

[56] S. Liao, M. Law, and A. Chung, “Dominant local binary patterns for textureclassification,” IEEE Transactions on Image Processing, vol. 18, pp. 1107–1118,May 2009.

[57] I. Laptev and T. Lindeberg, “Space-time interest points,” in IEEE InternationalConference on Computer Vision, pp. 432–439 vol.1, Oct 2003.

[58] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid, et al., “Evaluation oflocal spatio-temporal features for action recognition,” in BMVC 2009-BritishMachine Vision Conference, 2009.

97

Page 120: 2.2 Vision-based hand gesture representations

Bibliography

[59] S. Umakanthan, S. Denman, S. Sridharan, C. Fookes, and T. Wark, “Spatiotemporal feature evaluation for action recognition,” in International Conferenceon Digital Image Computing Techniques and Applications, pp. 1–8, Dec 2012.

[60] M.-H. Yang, N. Ahuja, and M. Tabb, “Extraction of 2d motion trajectories andits application to hand gesture recognition,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 24, pp. 1061–1074, Aug 2002.

[61] W. T. Freeman and M. Roth, “Orientation histograms for hand gesture recogni-tion,” in International Workshop on Automatic Face and Gesture Recognition,vol. 12, pp. 296–301, 1995.

[62] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms of ori-ented optical flow and binet-cauchy kernels on nonlinear dynamical systems forthe recognition of human actions,” in IEEE Conference on Computer Visionand Pattern Recognition, pp. 1932–1939, June 2009.

[63] S.-H. Lee, M.-K. Sohn, D.-J. Kim, B. Kim, and H. Kim, “Smart tv interac-tion system using face and hand gesture recognition,” in IEEE InternationalConference on Consumer Electronics, pp. 173–174, Jan 2013.

[64] M. Pourazad, A. Bashashati, and P. Nasiopoulos, “A random forests-basedapproach for estimating depth of human body gestures using a single videocamera,” in IEEE International Conference on Consumer Electronics, pp. 649–650, Jan 2011.

[65] L. Gallo, A. Placitelli, and M. Ciampi, “Controller-free exploration of med-ical image data: Experiencing the kinect,” in International Symposium onComputer-Based Medical Systems, pp. 1–6, June 2011.

[66] E. Choi and C. Lee, “Feature extraction based on the bhattacharyya distance,”in International Geoscience and Remote Sensing Symposium, vol. 5, pp. 2146–2148 vol.5, 2000.

[67] A. Mahmood and S. Khan, “Early terminating algorithms for adaboost baseddetectors,” in IEEE International Conference on Image Processing, pp. 1209–1212, Nov 2009.

[68] P. D. Kovesi, “MATLAB and Octave functions for computer vision andimage processing.” Centre for Exploration Targeting, School of Earthand Environment, The University of Western Australia. Available from:<http://www.csse.uwa.edu.au/∼pk/research/matlabfns/>.

[69] C. del Blanco, F. Jaureguizar, and N. Garcia, “An efficient multiple objectdetection and tracking framework for automatic counting and video surveillanceapplications,” Consumer Electronics, IEEE Transactions on, vol. 58, pp. 857–862, August 2012.

98

Page 121: 2.2 Vision-based hand gesture representations

BIBLIOGRAPHY

[70] J. Huang, J. Zhao, W. Gao, C. Long, L. Xiong, Z. Yuan, and S. Han, “Local bi-nary pattern based texture analysis for visual fire recognition,” in InternationalCongress on Image and Signal Processing, vol. 4, pp. 1887–1891, Oct 2010.

[71] K. Meena and A. Suruliandi, “Local binary patterns and its variants for facerecognition,” in International Conference on Recent Trends in InformationTechnology, pp. 782–786, June 2011.

[72] W. Liu, S. juan Li, and Y. jiang Wang, “Automatic facial expression recognitionbased on local binary patterns of local areas,” in International Conference onInformation Engineering, vol. 1, pp. 197–200, July 2009.

[73] “About Creative Senz3D sensor.” Available from:<http://es.creative.com/p/web-cameras/creative-senz3d>.

99