Recognizing Hand Gestures using the Weighted Elastic Graph ...€¦ · This paper proposes a...

Recognizing Hand Gestures using the Weighted Elastic Graph

Matching (WEGM) Method

Yu-Ting Li, Juan P. Wachs

Affiliation of the authors: Industrial Engineering, Purdue University, West Lafayette IN 47906, U.S.A

Correspondence and reprints: Juan P. Wachs, School of Industrial Engineering, Purdue University, West

Lafayette, IN 47907; Tel: (765) 496-7380; e-mail: [email protected]

Abstract

This paper proposes a weighted scheme for elastic graph matching hand posture recognition. Visual features scattered on the

elastic graph are assigned corresponding weights according to their relative ability to discriminate between gestures. The

weights’ values are determined using adaptive boosting. A dictionary representing the variability of each gesture class is

expressed in the form of a bunch graph. The positions of the nodes in the bunch graph are determined using three techniques:

manually, semi-automatically, and automatically. Experimental results also show that the semi-automatic annotation method is

efficient and accurate in terms of three performance measures; assignment cost, accuracy, and transformation error. In terms of

the recognition accuracy, our results show that the hierarchical weighting on features has more significant discriminative power

than the classic method (uniform weighting). The hierarchical elastic graph matching (WEGM) approach was used to classify a

lexicon of ten hand postures, and it was found that the poses were recognized with a recognition accuracy of 97.08% on average.

Using the weighted weighting, computing cycles can be decreased, by only computing the features for those nodes that their

weight is relatively high, and ignoring the remaining nodes. It was found that only 30% of the nodes need to be computed to

obtain a recognition accuracy of over 90%.

Keyword: Elastic bunch graph, Graph matching, Feature weight, Hand gesture recognition

1. Introduction

One of the main goals in the human-computer interaction (HCI) field is the study of innovative ways to enhance

the user experience through natural communication and developing the technology that enables such interaction. In

this context, new trends include the development of a new generation of smaller, cheaper and versatile sensors [1,2].

Users’ subjective satisfaction favors those systems that provide an enhanced interaction experience based on the

naturalness and expressiveness that they offer [3]. Among those modalities relying on natural interaction, gestures

are found explicitly, as one of the main channels to interact with computers in many fronts; such as sign language

interpretation [2], assistive technologies [4], and game control applications [5] to mention a few. Gestures are being

adopted also in areas where touch can be a vehicle of infection transmissions (e.g. browsing medical images in the

operating room) [6], or in outpatient clinics [7]. Common approaches for vision based-hand posture recognition

involve [8] (1) 3D model based methods [9], (2) appearance based model [10], and (3) shape analysis [11]. See [12]

for a detailed review on gesture based interfaces.

1.1. Elastic Graphs

Elastic graph matching (EGM) is a technique used for object recognition [13], where the object is represented by

a labeled bunch graph. The bunch graph consists of a connected graph where the most salient features on the image

are represented as series of nodes. A bunch graph is built on a set of template images (also called ‘dictionary’). To

compare the similarity between one template image within the bunch and a target image, the graph obtained from

the template image is matched against the target image. Filter responses are computed at each node in the graph, and

a cost function is minimized based on a metric applied to the nodes assignment. Over the years, EGM was

implemented for tasks such as face recognition [13,14], face verification [15] and gesture recognition [16]. In

Wiskott et al. [13], EGM was used to recognize the facial expressions in images where features were extracted from

typical face parts (e.g. the pupils, the beard, the nose, and the corners of the mouth). Triesch et al. employed EGM to

develop a classification approach for hand gestures against complex backgrounds [16]. EGM was also shown to

have better performance over eigenfaces [17], and auto-association and classification neural networks [18]. EGM

outperformed the aforementioned two methods due to its robustness to lighting variation, face position and

expression changes. Another variant of EGM, is a morphological elastic graph matching (MEGM) [19] which was

applied for frontal face authentication based on multi-scale dilation-erosion operations. One of the main drawbacks

of this method is the computational complexity involved in the detection and classification processes.

1.2. Motivation

One significant contribution of this paper is a procedure to establish the weight on the nodes in the graph thus

validating the importance of weighting the features. We propose the weighted elastic graph matching method

(WEGM) for hand posture recognition. In our method, those features with higher likelihood to appear in the target

image have higher weight compared to those features which are less consistent with the graph model. Using weight

allows us to allocate more computational resources to those features that are more discriminative while ignoring

those features with lower importance [20]. Three metrics are used in the experiment to show that features with more

discriminative power dominate the recognition performance of the system. A secondary contribution is a

comparative study on efficient annotation techniques to create the bunch graphs.

The rest of the paper is organized as follows: in Section 2 the Elastic Bunch Graph Matching (EBGM) and

Adaptive Boosting algorithm are described. In Section 3 the proposed annotation methods and the weighted hand

gesture recognition algorithm (WEGM) are presented. Experimental results in Section 4 demonstrate the feasibility

and efficiency of the proposed techniques. Finally the discussion and conclusions are presented in Section 5.

2. Fundamentals of Proposed Algorithms

2.1. Elastic Bunch Graph

The section below describes briefly the principles of Elastic Bunch Graph. For more details see [16]. Bunch

graphs were used to represent and recognize hand postures [16,21] in grayscale images. Each bunch graph is a

collection of individual graphs representing a posture. Salient points on the underlying image are labeled as nodes of

a graph over the object. The links connecting the nodes express some topological metric, such as the Euclidian

distance. A Gabor jet is defined as the set of responses on specific locations in the images obtained when convolving

a set of images (the dictionary set) with a bank of Gabor filters. The jet is a vector of complex responses at a given

pixel which follows the form:

( )

(

) ( ) (

) (1)

where ( ) is the Gabor-based kernel with the wave vector which describes the variation of spatial frequencies

and orientations, represented by the index and . Different values of are found

using:

(

)

(2)

where is the number of frequency levels and is the number of orientations. The following parameters were

chosen based on empirical studies [16]: = √ , and . The width of the Gaussian envelope function is

with 2.5. The jet is a complex vector consisting of filter responses and it is defined as

( ). is used to compute the similarity of a target image and a bunch graph (obtained from dictionary

images), whose node positions are annotated a priori. In this paper, the objects of interest are hand postures. Thus,

the classification of a given image as a gesture is obtained by measuring the likelihood of two jets (one from the

target image and one from the bunch graph). The similarity function using the magnitude and phase of the two

jets to find a matching score between the target image and the bunch graph is stated as follow:

( )

(

∑ (

)

√∑

∑

) (3)

where and

is obtained from

( ), the jet is derived from the target image. The phase information

varies rapidly between continuous pixels, thus providing an advantageous mean to have a good initial estimate about

the position of the hand within the target image.

2.2. Elastic Bunch Graph Matching Procedure

The classification task is done by finding the position of the template which maximizes the similarity between

the bunch graph and the target image. The detailed Elastic Bunch Graph Matching (EBGM) procedure consists of

three steps [16]:

Approximately position the graph: The bunch graph is applied on the image and scanned in steps of 3 pixels in

both x and y direction. All the nodes in each bunch graph are visited and compared, the similarity score of the

matching is given by a linear combination of the scores between the nodes in the bunch graph and the target

image.

Rescale the graph: The bunch graph can be resized by up to +20% and –20% (five scales are used) without

relative changes of the edge lengths.

Refining position of each node: All nodes are allowed to vary +3 and –3 pixels from the position found in step

1. A penalty cost is introduced to prevent great distortion of the graph:

∑ ( ) (4)

where ( ) is the cost of the difference of edges before and after shifting the graph relative to the original

lengths. Considering the distortions of the nodes, the total score of the matching becomes:

(5)

where determines the extent of penalizing the solutions that depart from the original positions. In this paper,

the value of is chosen the same as the state-of-art approach. [16] in order to perform the comparison analysis.

During the overall matching process, the best fitting jet is selected according to the maximum similarity score in

Eq. (5) among the bunch graphs. The classification is determined by the maximum score over all the detectors

(Max-Wins rule [22]).

2.3. Adaptive Boosting

In this paper we use boosting to assign weighted values to the nodes within the bunch graph to maximize the

recognition accuracy. These weights are in practice coefficients that maximize the discriminative function between

feature vectors that are retrieved from specific positions in the hand and negative observations.

Boosting [23,24,25], is a general machine learning technique used to design, train and test classifiers by

combining a series of weak classifiers to create a strong classifier. This technique was adopted in our posture

recognition algorithm to reflect the weight of nodes in the bunch graphs. In boosting technique, a family of weak

classifiers forms an additive model:

( ) ∑ ( ) (6)

where ( ) denotes a weak detector, is a feature vector , and M is the number of iterations (or number of weak

detectors) to form a strong classifier, ( ). When training, a weight is associated with each of the training samples,

which is updated in each iteration. The updates increase the weight of the samples which are misclassified at the

current iteration, and decrease the weights of those which were correctly classified. The weights ( ) for

each training sample with class label , are defined so the cost of misclassification is minimized by adding a new

optimal weak classifier that meets:

∑ ( ( ))

(7)

Upon choosing the weak classifier and adding to ( ) , the estimates are updated: ( ) ( ) ( ) .

Accordingly, the weights over the samples are updated by:

( ) (8)

In this paper, the gentleboost cost function [23] is used to minimize the error.

3. Hand Gesture Recognition Methodology

3.1. Node Annotation Techniques

The bunch graph was created by selecting a set of nodes for each image which belongs to the dictionary set. Each

node has to represent the same landmark in the hand in every image in the set. The process of selecting these nodes

is called “annotation”. Two types of nodes were annotated: edge nodes (nodes lying on the contour of the hand) and

inner nodes (nodes lying inside the contour). Three methods to accomplish the annotation task were compared in

this paper: manual, semi-automatic and automatic. Among these three methods, semi-automatic and automatic

approaches were proposed to compare with the standard manual annotation approach. The manual method consists

of selecting by hand every landmark in every image and ensuring that every landmark corresponds roughly to the

same point in all the images in the dictionary set. In the automatic method, the landmarks are automatically selected

by applying a Harris corner detector [26], which highly responds to highly textured regions within the hand. The

semi-automatic approach is the same as the automatic approach except that it allows the user to correct manually

those points that were detected automatically but had an offset with respect to visually identified landmarks. All the

methods rely on the fact that the contour in every image was annotated manually for precise alignment.

The difference among these three methods is the manner on which the nodes are selected within the hand region.

For the two methods (automatic and semi-automatic), one reference graph is chosen and the remaining five graphs

are aligned with respect to it. A linear assignment problem (LAP) is applied to find the points in each graph in the

bunch that better correspond to those points in the reference graph. The objective is to find the least displacement

pairs of nodes from a larger set of candidates of the current graph. This is a minimization problem which

formulation is provided in Eq. (9 and 10):

(∑

) (9)

∑ (10)

where ‖(

) (

)‖ is the Euclidian distance between the nodes ( ), (

) is

the node of the reference graph, and (

) is the node of the graph to be matched. The detailed process is

summarized in following Algorithm table.

Algorithm : Node Annotation

Input: Edge nodes of images from dictionary set B;

for all I B do

// given I as the reference graph with outer nodes

HarrisCornerDetector(I)

for all J do

// J as the graph to be aligned with outer nodes

HarrisCornerDetector(J)

[t, r, s] Alignment( , ) // translation, rotation, and scaling

PointTransformation( )

[ LAP( ) // Linear assignment problem returns

the assigned nodes and optimized costs

end for

sum( ) // summation of total cost when I as the reference graph

end for

( ) // best reference graph with minimum total cost

optimal reference graph, save all the annotations.

The effectiveness of the proposed annotation methods is evaluated by three different metrics: (1) Costs entailed to

match the nodes. Relative displacements of the nodes with respect to each other in the different graphs result in a

matching ‘cost’. The Euclidean distances of each pair of nodes are summed up as the total matching costs. (2)

Transformation errors are those errors resulting from affine transformation disparities between the reference graph

and the ones aligned to it [27] (see Eq. (11) below). (3) Errors in recognition accuracy are those errors that can be

observed once the bunch graph is built and used to classify the postures in the testing state.

[ ] [ ] (11)

where is the optimal rotational ( ) and scaling matrix ( ) (least-square minimization approach is used to reach the

optimal) applied to :

[

] (12)

where is the vector representation of the coordinates the points in each image (

), Also, is the

optimal translation parameter.

The semi-automatic approach allows the user to correct manually those points that were detected automatically.

The correction is done by subjective observation, while the automatic method does not allow re-placing the nodes

once found. To this end, the tradeoffs between the semi-automatic and automatic approaches are the time-saving

with higher matching cost and transformation error and thus affecting the recognition accuracy.

3.2. Weighted Weighting on Features

We propose to assign a weight to each node of the graph. The standard approach assumes that equal weights are

given to every node in the bunch graph when determining the similarity function for graph matching. However,

some features of the hand are more dominant than others, in terms of their discriminative power. Thus, the

importance (weight) over the nodes should be considered to reflect this attribute within the total similarity metric

. The similarity metric is weighted by the coefficient vector that represents the discriminatory degree of each

node:

∑ ( ( ) ( ( ))) (11)

where is the bunch graph with node index , and ( ) is the jet computed from the target image taken at node

vector . The adaptive boosting described at Section 2.3 is used to train a strong classifier to classify the observed

vectors based on the score . For different hand postures, the classifiers are trained separately. Positive (true hits)

are created by extracting the feature vectors assigned the nodes in the positive images. Negative samples are feature

vectors extracted by searching the best matching location of a bunch graph in the negative set of images from the

training set (this method is broadly used to find negative instances that could potentially be recognized as true hits).

Figure 1 shows the similarity response of a sample image when the similarity metric is computed with and without

weight assignment (the bunch graph is scanned over the entire image with an increment of 4 pixels).

Fig. 1. Similarity responses of bunch graph matched to an example image (a) with weight; (b) without weight

As can be seen, the similarity response when weight is used (the left image) is more ‘focused’ in a single point

than the response without weight (the right image). In other words, the similarity scores of the entire image exhibit

global maxima when weight is applied. The more focused the response is, the fewer local maxima, which provides

more effective and reliable decision criterion.

Fig. 2. Weight distribution on an example image

Figure 2 shows the importance of the nodes represented by a heat map (the edges are omitted to emphasize the

nodes coloring system). Warm colors represent high weight, while cold ones represent low weight. As can be seen,

for those nodes with positions that blend with the background, lower weights are assigned (yellow color). On the

other hand, those nodes over the rim of the hand are assigned higher weights (warmer colors) since they are more

distinct from the background, and more descriptive of the hand.

3.3. Determining dominance of Features

According to the testing results shown in Section 3.2, the ability to better discriminate features leads to a better

decision surface which enabling a more reliable classification. Furthermore, the fact that some features have

assigned lower hierarchies indicates that their effect on the classification performance is low compared to those

features assigned with higher hierarchies. Thus, the computation of these features can be skipped without affecting

the recognition accuracy substantially. To explore the effect of the number of features selected on the algorithm,

XY

Similarity Response without Hierarchy on Features

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

18

20 0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

X

Y

Similarity Response with Hierarchy on Features

2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

18

20

performance (the classification results) of three scenarios were studied, where in each, the features (the nodes) were

selected in different fashions:

1. Selection by weight: Sort the features by their assigned weights in descending order and select the N

highest ranked features.

2. Selection by the magnitude of similarity: Sort the features by their magnitude of similarity score and select

N highest ranked features.

3. Random selection: Randomly order the feature in a list, and select the N highest ranked features.

4. Experimental Results

The proposed methods were validated with the Triesch’s hand posture dataset [28]. The dataset consists of 10

difference hand gestures against complex, light and dark backgrounds; performed by 24 people. This results in a

total of 710 gray-scale 128x128 pixels each image. Each bunch graph was created by selecting two instances of a

given posture performed by three subjects against light and dark backgrounds (total six instances in each bunch

graph). This constitutes the dictionary used. The geometry of the nodes (their position) on the bunch graph was

averaged from the six graphs. Overall 60 images were used to create the bunch graphs. The remaining 650 images

were used for the training and testing datasets. The results presented correspond to the classification performance

among the extracted features from those 650 images. Examples showing the WEGM’s detection performance are

showed in Figure 3. The corresponding bunch graphs were fitted to 10 hand postures. Each image was scanned

increments of 4 pixels in the horizontal and vertical directions.

Fig. 3. 10 classes of sample hand gesture images after matching process

The various colors from warm to cold colors were used to represent nodes’ hierarchies. Light blue lines indicate

the edges. The edges were allowed to distort to reflect the variation of gesture among images within the same

category. The green dots represent the annotated nodes.

Several RGB images were captured to test the WEGM detection algorithm. The images were resized to 128x128

pixels and the bunch graphs were scanned over the image by an increment of 4 pixels. The matching results of

several examples of three hand gestures’ images with light, dark, and complex backgrounds are shown in Figure 4.

Fig. 4. Example hand gestures RGB images after the matching process

4.1. Hand Gesture Classification

The Receiver Operating Characteristic (ROC) curves are presented in Figure 5. The curves were generated using

5-fold cross-validation for the 10 hand gestures. A true positive was determined based on whether the classification

score was greater than a given threshold (found empirically), otherwise it was regarded as a miss. When an

observation was classified as a certain gesture, which was in fact a true negative, this event was considered as a false

alarm. Following this guideline, ROC curves were plotted to show the relationship between the true positives and

false alarms among the 10 classes, one for each hand gesture. The average recognition accuracy was 91.84%. This

value was found by averaging all the 10 recognition accuracies on the operational point (the point closest to the top

left corner of the graph).

Fig. 5. ROC curve for weight-based hand gesture recognition

The second metric to evaluate the hand gesture classification performance was the maximum score over the 10

classifiers (Max-Wins rule). This metric always assures a single detection (correct or incorrect), and no false

positive cases. If the maximum score fell on the incorrect class, that gesture was misclassified (it was considered a

confusion). The confusion matrix (see Figure 6) was calculated by comparing the scores delivered by each classifier

on a given sample image, and taking the maximum from all the classifiers. The average accuracy of correct

classification over the confusion matrix reached 97.08%. Both these values show better performance to those

reported in the literature [16,21]. In order to show the improvement to be significant, the t-Test of paired two

samples (650 observations for each) for equal means is conducted on the classification results of WEGM and EGM

[16]. The one tailed p-value (1.5665E-06 < .05) of the statistical test indicated that the classification is statistically

significant regarding the performance.

Fig. 6. Confusion Matrix for 10 Gestures.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Tru

e P

ositiv

e R

ate

Gesture A

Gesture B

Gesture C

Gesture D

Gesture G

Gesture H

Gesture I

Gesture L

Gesture V

Gesture Y

Average

95.38 1.54 1.54 1.54

1.54 98.46

1.54 96.92 1.54

1.54 96.92 1.54

1.54 98.46

95.38 3.08 1.54

3.08 1.54 95.38

1.54 98.46

100

1.54 1.54 1.54 95.38

A B C D G H I L V Y

A

B

C

D

G

H

I

L

V

Y

90

91

92

93

94

95

96

97

98

99

100

4.2. Facial Expression Classification

To illustrate generality of weighted feature approach, the algorithm is tested on the MacBrain Face Stimulus

database [29]. The dataset consists of 16 emotions performed by 40 people. The dictionary includes 96 faces to

build the bunch graph per face. The nodes of the bunch graph are annotated on the fiducial points such as inner and

outer corners of the eyes, inner and outer end of the eyebrows, tips of the nose, and corners of the mouth. We use

480 images (30 faces per facial expression) to conduct 5-fold cross-validation. The confusion matrix (see Figure 7)

of classified emotions presents 96.88% accuracy of correct classification. Figure 8 shows 16 facial expressions when

the bunch graph showing the weighted node is applied. The results show the WEGM algorithm is also applicable to

other types of human features (facial expression) classification.

Fig. 7. Confusion Matrix for 16 Facial Expressions.

Fig. 8. Examples of 16 facial expression images used with matched bunch graphs.

96.67 3.33

93.33 6.67

96.67 3.33

96.67 3.33

3.33 96.67

100

100

96.67 3.33

100

100

100

3.33 90 6.67

100

100

3.33 93.33

3.33 3.33 3.33 3.33 90

AnC AnO CaC CaO DiC DiO FeC FeO HaC HaO HaX NeC NeO SaC SaO SuO

Angry_C

Angry_O

Calm_C

Calm_O

Disgust_C

Disgust_O

Fear_C

Fear_O

Happy_C

Happy_O

Happy_X

Neutral_C

Neutral_O

Sad_C

Sad_O

Surprise_O

80

82

84

86

88

90

92

94

96

98

100

4.3. Weight-Based Feature Selection

Three different scenarios were studied to validate the effect of the number of features selected (and how they

were selected), on the classification accuracy. Figure 7 shows the recognition results when applying the three

different feature reduction policies (weights, magnitude of similarity and random). Once the features/nodes were

sorted, only the N top percentage of the sorted list was selected to determine the effects on recognition accuracy.

Nine cases were evaluated from100% (no reduction) to 10% with decrements of 10% of the total number of features.

The responses are presented in Figure 9. It can be seen that up to 30% of the nodes can be discarded without

reducing the recognition accuracy below 90% when the first selection policy was applied. The recognition accuracy

decreases at a pace slower than the other two scenarios, (selected by magnitude of the similarity and randomly). The

worst results occurred when features were discarded randomly. When the second scenario was applied (the features

were selected by the sorted magnitude of similarity score), 50% of the nodes were required to assure 90% of

recognition accuracy. It can be seen that in this scenario, the overall performance was not good as in the first

scenario, but still better than when the selection of nodes was random. Thus, the experimental results show that

using the WEGM method, the computation time can be reduced by 30% by discarding those nodes which have not a

significant effect on the overall recognition accuracy.

Fig. 9. Recognition Accuracy vs. Reduced Features.

4.4. Performance on Different Annotation Techniques

In this section the performance of each annotation technique used to create the bunch graph is discussed. In the

automatic and semi-automatic methods candidate nodes were found in highly textured regions inside the hand. The

semi-automatic method allowed nodes to be adjusted manually after detected. The results displayed in Figure 10

illustrate the performance measures when using the three different methods to annotate the nodes used in the bunch

graphs. Three classifiers were trained using the three different annotation methods, and tested with light and dark

background images. When using the semi-automatic technique tested with light and dark background images, the

recognition error (7.88%) was less than the other two methods (9.07%, and 10.74% for manually and automatically,

respectively). The normalized matching cost was the highest for the automatic technique due to the inconsistency of

the nodes’ position among the graphs. For the similar reason, the normalized transformation error was also the

100 90 80 70 60 50 40 30 20 100.4

0.5

0.6

0.7

0.8

0.9

1

Percentage of Feature Used for Classification (%)

Re

co

gn

itio

n A

ccu

racy

Selection by weight

Selection by the magnitude of similarity

Random Selection

highest for the automatic technique. However, the costs and errors of matching between manual and semi-automatic

approaches were comparable. The recognition error was slightly greater for the manual case. Although matching

costs and errors of the semi-automatic method were slightly greater than those using the manual method, these

measures were substantially less than those when using automatic method. Therefore, there is a trade-off between

recognition error and speed of creating the annotation, which are expressed by the high matching costs and

transformation errors. The proposed semi-automatic technique is efficient annotation method for building up the

bunch graph faster when desired recognition accuracy is acceptable.

Fig. 10. Performance Measures for different annotation techniques.

5. Conclusion

This research proposed an enhanced graph-based approach incorporating the concept of nodes weight (WEGM)

to recognize a lexicon of ten hand gestures. The WEGM algorithm was validated using a standard dataset of

postures against three different backgrounds: light, dark and complex. The WEGM algorithm classified the postures

with a recognition accuracy of 97.08% on average. This shows that introducing weight in the bunch graphs improves

the overall performance. The reason for this is that the WEGM augment the discriminatory power of the nodes for

each gesture with respect to the remaining gestures. Furthermore, by computing the features of only the nodes with a

relative high weight, and discarding the rest, the recognition performance is not significantly affected. Thus, the

WEGM approach improves the recognition performance while reducing the computational time required for

computing the features.

Additionally, semi-automatic and automatic annotation techniques were proposed which allow the flexible

selection of nodes which are consistent between images of the same posture. The semi-automatic approach showed

to deliver the highest recognition accuracy (lowest recognition error) though not the least matching costs and

transformation, compared to the manual and automatic methods to construct the bunch graphs.

Future works includes extending the WEGM algorithm to include depth information with color. One simple

approach would be to use the range information to have a good initial region of interest for matching the WEGM

with the target image. This will result in a smaller search and will reduce the overall computation time. In addition,

we are interested in experimenting with multimodal images (thermal, depth and color) and suggest an efficient

Manual Semi-automatic Automatic0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Annotation Method

Pe

rfo

rma

nce

Me

asu

re

Matching Cost

Recognition Error

Transformation Error

method to combine these modalities to enhance overall performance. We plan to experiment with other features, like

Wavelets, and HOGs, and include a larger and more complex datasets. Finally, we will develop a parallel

implementation of this algorithm for real-time detection and classification of hand gestures.

Reference

[1] R. Poppe, R. Rienks, Evaluating the Future of HCI: Challenges for the Evaluation of Emerging Applications, Proceedings of the International

conference on Artificial intelligence for human computing 4451 (2007) pp. 234-250.

[2] S. M. M. Roomi, R. J. Priya, H. Jayalakshmi, Hand Gesture Recognition for Human-Computer Interaction, Journal of Computer Science 6(9)

(2010) pp. 1002-1007.

[3] S. S. Rautaray, A. Agrawal, Interaction with virtual game through hand gesture recogni-tion, International Conference on Multimedia, Signal

Processing and Communication Tech-nologies (IMPACT) (Dec. 2011) pp.244-247.

[4] Y.-J. Chang, S.-F. Chen, A.-F. Chuang, A gesture recognition system to transition auto-nomously through vocational tasks for individuals

with cognitive impairments, Research in Developmental Disabilities 32(6) (2011) pp. 2064-2068.

[5] T. Leyvand, C. Meekhof, Yi-Chen Wei, Jian Sun, Baining Guo, Kinect Identity: Technology and Experience, Computer, 44(4) (2011) pp. 94-

96.

[6] J.P. Wachs, H. I. Stern, Y. Edan, M. Gillam, J. Handler, C. Feied, M. Smith, A gesture-based tool for sterile browsing of radiology images,

Journal of the American Medical Infor-matics Association 15(3) (2008) pp. 321–323.

[7] K. Wood, C. E. Lathan, K. R. Kaufman, Development of an interactive upper extremity gestural robotic feedback system: From bench to

reality, Annual International Conference of the IEEE on Engineering in Medicine and Biology Society (EMBC) (Sept. 2009) pp. 5973-5976.

[8] S. Bilal, R. Akmeliawati, M. J. El Salami, A. A. Shafie, Vision-based hand posture detection and recognition for Sign Language - A study,

2011 4th International Conference On Mechatronics (ICOM) (May 2011) pp.1-6.

[9] M. de La Gorce, D. J. Fleet, N. Paragios, Model-Based 3D Hand Pose Estimation from Monocular Video, IEEE Transactions on Pattern

Analysis and Machine Intelligence 33(9) (2011) pp. 1793-1805.

[10] S. Koelstra, M. Pantic, I. Patras, A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models, IEEE

Transactions on Pattern Analysis and Ma-chine Intelligence 32(11) (Nov. 2010) pp. 1940-1954.

[11] Weiqi Yuan, Lantao Jing, Hand-Shape Feature Selection and Recognition Performance Analysis, 2011 International Conference on Hand

Based Biometrics (Nov. 2011) pp. 1-6.

[12] J. P. Wachs, M. Kölsch, H. Stern, Y. Edan, Vision-based handgesture applications, Communications of the ACM 54(2) (2011) pp. 60-71.

[13] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, Face recognition by elastic bunch graph matching, International Conference on

Image Processing 1 (1997) pp.129-132.

[14] H.-C. Shin, S.-D. Kim, H.-C. Choi, Generalized elastic graph matching for face recognition, Pattern Recognition Letters 28(9) (2007) pp.

1077-1082.

[15] A. Tefas, A. Kotropoulos, I. Pitas, Face verification using elastic graph matching based on morphological signal decomposition, Signal

Processing 82(6) (2002) pp. 833-851.

[16] J. Triesch, C. von der Malsburg, Robust classification of hand postures against complex backgrounds, Proceedings of the Second

International Conference on Automatic Face and Gesture Recognition (Oct. 1996) pp.170-175.

[17] M. A. Turk, A.P. Pentland, Face recognition using eigenfaces, IEEE Computer Society Conference on Computer Vision and Pattern

Recognition (Jun 1991) pp. 586-591.

[18] Jun Zhang, Yong Yan, M. Lades, Face recognition: eigenface, elastic matching, and neural nets, Proceedings of the IEEE 85(9) (Sept. 1997)

pp. 1423-1435.

[19] C. Kotropoulos, A. Tefas, I. Pitas, Frontal face authentication using morphological elastic graph matching, IEEE Transactions on Image

Processing 9(4) (Apr. 2000) pp. 555-560.

[20] Y.-T. Li, J. P. Wachs, Hierarchical Elastic Graph Matching for Hand Gesture Recognition Progress in Pattern Recognition, Image Analysis,

Computer Vision, and Applications, L. Alvarez, M. Mejail, L. Gomez, and J. Jacobo, Eds. Springer Berlin Heidelberg (2012) pp. 308-315. doi:

10.1007/978-3-642-33275-3_38

[21] P. P. Kumar, P. Vadakkepat, Loh Ai Poh, Graph matching based hand posture recognition using neuro-biologically inspired features, 11th

International Conference on Control Auto-mation Robotics & Vision (ICARCV) (Dec. 2010) pp. 1151-1156.

[22] J. H. Friedman, Another approach to polychotomous classification, Technical report, Standford Department of Statistics, 1996.

[23] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics, 28(2) (2000) pp. 337-

374.

[24] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing features: efficient boosting procedures for multiclass object detection, Proceedings of

the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2 (July 2004) pp. II-762- II-769.

[25] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing Visual Features for Multiclass and Multiview Object Detection, IEEE Transactions on

Pattern Analysis and Machine Intelligence 29(5) (May 2007) pp.854-869.

[26] C. Harris, M. Stephens, A combined corner and edge detector, Fourth Alvey Vision Conference, Manchester, UK (1988) pp. 147-151.

[27] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis, and Machine Vision, Thomas Engineering, Toronto Canada, 3rd edition, 2008.

[28] Sebastien marcel hand posture and gesture datasets: Jochen triesch static hand posture database. http://www.idiap.ch/resource/gestures/

[29] N. Tottenham, J. Tanaka, A. C. Leon, T. McCarry, M. Nurse, T. A. Marcus, A. Westerlund, B. J. Casey, C. A. Nelson, The NimStim set of

facial expressions: judgments from untrained research participants, Psychiatry Research 168(3) (2009) pp. 242-9.

http://www.idiap.ch/resource/gestures/

Recognizing Hand Gestures using the Weighted Elastic Graph ...€¦ · This paper proposes a...

Documents

Transcript of Recognizing Hand Gestures using the Weighted Elastic Graph ...€¦ · This paper proposes a...