Recognizing Hand Gestures using the Weighted Elastic Graph ...€¦ · This paper proposes a...
Transcript of Recognizing Hand Gestures using the Weighted Elastic Graph ...€¦ · This paper proposes a...
Recognizing Hand Gestures using the Weighted Elastic Graph
Matching (WEGM) Method
Yu-Ting Li, Juan P. Wachs
Affiliation of the authors: Industrial Engineering, Purdue University, West Lafayette IN 47906, U.S.A
Correspondence and reprints: Juan P. Wachs, School of Industrial Engineering, Purdue University, West
Lafayette, IN 47907; Tel: (765) 496-7380; e-mail: [email protected]
Abstract
This paper proposes a weighted scheme for elastic graph matching hand posture recognition. Visual features scattered on the
elastic graph are assigned corresponding weights according to their relative ability to discriminate between gestures. The
weights’ values are determined using adaptive boosting. A dictionary representing the variability of each gesture class is
expressed in the form of a bunch graph. The positions of the nodes in the bunch graph are determined using three techniques:
manually, semi-automatically, and automatically. Experimental results also show that the semi-automatic annotation method is
efficient and accurate in terms of three performance measures; assignment cost, accuracy, and transformation error. In terms of
the recognition accuracy, our results show that the hierarchical weighting on features has more significant discriminative power
than the classic method (uniform weighting). The hierarchical elastic graph matching (WEGM) approach was used to classify a
lexicon of ten hand postures, and it was found that the poses were recognized with a recognition accuracy of 97.08% on average.
Using the weighted weighting, computing cycles can be decreased, by only computing the features for those nodes that their
weight is relatively high, and ignoring the remaining nodes. It was found that only 30% of the nodes need to be computed to
obtain a recognition accuracy of over 90%.
Keyword: Elastic bunch graph, Graph matching, Feature weight, Hand gesture recognition
1. Introduction
One of the main goals in the human-computer interaction (HCI) field is the study of innovative ways to enhance
the user experience through natural communication and developing the technology that enables such interaction. In
this context, new trends include the development of a new generation of smaller, cheaper and versatile sensors [1,2].
Users’ subjective satisfaction favors those systems that provide an enhanced interaction experience based on the
naturalness and expressiveness that they offer [3]. Among those modalities relying on natural interaction, gestures
are found explicitly, as one of the main channels to interact with computers in many fronts; such as sign language
interpretation [2], assistive technologies [4], and game control applications [5] to mention a few. Gestures are being
adopted also in areas where touch can be a vehicle of infection transmissions (e.g. browsing medical images in the
operating room) [6], or in outpatient clinics [7]. Common approaches for vision based-hand posture recognition
involve [8] (1) 3D model based methods [9], (2) appearance based model [10], and (3) shape analysis [11]. See [12]
for a detailed review on gesture based interfaces.
1.1. Elastic Graphs
Elastic graph matching (EGM) is a technique used for object recognition [13], where the object is represented by
a labeled bunch graph. The bunch graph consists of a connected graph where the most salient features on the image
are represented as series of nodes. A bunch graph is built on a set of template images (also called ‘dictionary’). To
compare the similarity between one template image within the bunch and a target image, the graph obtained from
the template image is matched against the target image. Filter responses are computed at each node in the graph, and
a cost function is minimized based on a metric applied to the nodes assignment. Over the years, EGM was
implemented for tasks such as face recognition [13,14], face verification [15] and gesture recognition [16]. In
Wiskott et al. [13], EGM was used to recognize the facial expressions in images where features were extracted from
typical face parts (e.g. the pupils, the beard, the nose, and the corners of the mouth). Triesch et al. employed EGM to
develop a classification approach for hand gestures against complex backgrounds [16]. EGM was also shown to
have better performance over eigenfaces [17], and auto-association and classification neural networks [18]. EGM
outperformed the aforementioned two methods due to its robustness to lighting variation, face position and
expression changes. Another variant of EGM, is a morphological elastic graph matching (MEGM) [19] which was
applied for frontal face authentication based on multi-scale dilation-erosion operations. One of the main drawbacks
of this method is the computational complexity involved in the detection and classification processes.
1.2. Motivation
One significant contribution of this paper is a procedure to establish the weight on the nodes in the graph thus
validating the importance of weighting the features. We propose the weighted elastic graph matching method
(WEGM) for hand posture recognition. In our method, those features with higher likelihood to appear in the target
image have higher weight compared to those features which are less consistent with the graph model. Using weight
allows us to allocate more computational resources to those features that are more discriminative while ignoring
those features with lower importance [20]. Three metrics are used in the experiment to show that features with more
discriminative power dominate the recognition performance of the system. A secondary contribution is a
comparative study on efficient annotation techniques to create the bunch graphs.
The rest of the paper is organized as follows: in Section 2 the Elastic Bunch Graph Matching (EBGM) and
Adaptive Boosting algorithm are described. In Section 3 the proposed annotation methods and the weighted hand
gesture recognition algorithm (WEGM) are presented. Experimental results in Section 4 demonstrate the feasibility
and efficiency of the proposed techniques. Finally the discussion and conclusions are presented in Section 5.
2. Fundamentals of Proposed Algorithms
2.1. Elastic Bunch Graph
The section below describes briefly the principles of Elastic Bunch Graph. For more details see [16]. Bunch
graphs were used to represent and recognize hand postures [16,21] in grayscale images. Each bunch graph is a
collection of individual graphs representing a posture. Salient points on the underlying image are labeled as nodes of
a graph over the object. The links connecting the nodes express some topological metric, such as the Euclidian
distance. A Gabor jet is defined as the set of responses on specific locations in the images obtained when convolving
a set of images (the dictionary set) with a bank of Gabor filters. The jet is a vector of complex responses at a given
pixel which follows the form:
( )
(
) ( ) (
) (1)
where ( ) is the Gabor-based kernel with the wave vector which describes the variation of spatial frequencies
and orientations, represented by the index and . Different values of are found
using:
(
)
(2)
where is the number of frequency levels and is the number of orientations. The following parameters were
chosen based on empirical studies [16]: = √ , and . The width of the Gaussian envelope function is
with 2.5. The jet is a complex vector consisting of filter responses and it is defined as
( ). is used to compute the similarity of a target image and a bunch graph (obtained from dictionary
images), whose node positions are annotated a priori. In this paper, the objects of interest are hand postures. Thus,
the classification of a given image as a gesture is obtained by measuring the likelihood of two jets (one from the
target image and one from the bunch graph). The similarity function using the magnitude and phase of the two
jets to find a matching score between the target image and the bunch graph is stated as follow:
( )
(
∑ (
)
√∑
∑
) (3)
where and
is obtained from
( ), the jet is derived from the target image. The phase information
varies rapidly between continuous pixels, thus providing an advantageous mean to have a good initial estimate about
the position of the hand within the target image.
2.2. Elastic Bunch Graph Matching Procedure
The classification task is done by finding the position of the template which maximizes the similarity between
the bunch graph and the target image. The detailed Elastic Bunch Graph Matching (EBGM) procedure consists of
three steps [16]:
Approximately position the graph: The bunch graph is applied on the image and scanned in steps of 3 pixels in
both x and y direction. All the nodes in each bunch graph are visited and compared, the similarity score of the
matching is given by a linear combination of the scores between the nodes in the bunch graph and the target
image.
Rescale the graph: The bunch graph can be resized by up to +20% and –20% (five scales are used) without
relative changes of the edge lengths.
Refining position of each node: All nodes are allowed to vary +3 and –3 pixels from the position found in step
1. A penalty cost is introduced to prevent great distortion of the graph:
∑ ( ) (4)
where ( ) is the cost of the difference of edges before and after shifting the graph relative to the original
lengths. Considering the distortions of the nodes, the total score of the matching becomes:
(5)
where determines the extent of penalizing the solutions that depart from the original positions. In this paper,
the value of is chosen the same as the state-of-art approach. [16] in order to perform the comparison analysis.
During the overall matching process, the best fitting jet is selected according to the maximum similarity score in
Eq. (5) among the bunch graphs. The classification is determined by the maximum score over all the detectors
(Max-Wins rule [22]).
2.3. Adaptive Boosting
In this paper we use boosting to assign weighted values to the nodes within the bunch graph to maximize the
recognition accuracy. These weights are in practice coefficients that maximize the discriminative function between
feature vectors that are retrieved from specific positions in the hand and negative observations.
Boosting [23,24,25], is a general machine learning technique used to design, train and test classifiers by
combining a series of weak classifiers to create a strong classifier. This technique was adopted in our posture
recognition algorithm to reflect the weight of nodes in the bunch graphs. In boosting technique, a family of weak
classifiers forms an additive model:
( ) ∑ ( ) (6)
where ( ) denotes a weak detector, is a feature vector , and M is the number of iterations (or number of weak
detectors) to form a strong classifier, ( ). When training, a weight is associated with each of the training samples,
which is updated in each iteration. The updates increase the weight of the samples which are misclassified at the
current iteration, and decrease the weights of those which were correctly classified. The weights ( ) for
each training sample with class label , are defined so the cost of misclassification is minimized by adding a new
optimal weak classifier that meets:
∑ ( ( ))
(7)
Upon choosing the weak classifier and adding to ( ) , the estimates are updated: ( ) ( ) ( ) .
Accordingly, the weights over the samples are updated by:
( ) (8)
In this paper, the gentleboost cost function [23] is used to minimize the error.
3. Hand Gesture Recognition Methodology
3.1. Node Annotation Techniques
The bunch graph was created by selecting a set of nodes for each image which belongs to the dictionary set. Each
node has to represent the same landmark in the hand in every image in the set. The process of selecting these nodes
is called “annotation”. Two types of nodes were annotated: edge nodes (nodes lying on the contour of the hand) and
inner nodes (nodes lying inside the contour). Three methods to accomplish the annotation task were compared in
this paper: manual, semi-automatic and automatic. Among these three methods, semi-automatic and automatic
approaches were proposed to compare with the standard manual annotation approach. The manual method consists
of selecting by hand every landmark in every image and ensuring that every landmark corresponds roughly to the
same point in all the images in the dictionary set. In the automatic method, the landmarks are automatically selected
by applying a Harris corner detector [26], which highly responds to highly textured regions within the hand. The
semi-automatic approach is the same as the automatic approach except that it allows the user to correct manually
those points that were detected automatically but had an offset with respect to visually identified landmarks. All the
methods rely on the fact that the contour in every image was annotated manually for precise alignment.
The difference among these three methods is the manner on which the nodes are selected within the hand region.
For the two methods (automatic and semi-automatic), one reference graph is chosen and the remaining five graphs
are aligned with respect to it. A linear assignment problem (LAP) is applied to find the points in each graph in the
bunch that better correspond to those points in the reference graph. The objective is to find the least displacement
pairs of nodes from a larger set of candidates of the current graph. This is a minimization problem which
formulation is provided in Eq. (9 and 10):
(∑
) (9)
∑ (10)
where ‖(
) (
)‖ is the Euclidian distance between the nodes ( ), (
) is
the node of the reference graph, and (
) is the node of the graph to be matched. The detailed process is
summarized in following Algorithm table.
Algorithm : Node Annotation
Input: Edge nodes of images from dictionary set B;
for all I B do
// given I as the reference graph with outer nodes
HarrisCornerDetector(I)
for all J do
// J as the graph to be aligned with outer nodes
HarrisCornerDetector(J)
[t, r, s] Alignment( , ) // translation, rotation, and scaling
PointTransformation( )
[ LAP( ) // Linear assignment problem returns
the assigned nodes and optimized costs
end for
sum( ) // summation of total cost when I as the reference graph
end for
( ) // best reference graph with minimum total cost
optimal reference graph, save all the annotations.
The effectiveness of the proposed annotation methods is evaluated by three different metrics: (1) Costs entailed to
match the nodes. Relative displacements of the nodes with respect to each other in the different graphs result in a
matching ‘cost’. The Euclidean distances of each pair of nodes are summed up as the total matching costs. (2)
Transformation errors are those errors resulting from affine transformation disparities between the reference graph
and the ones aligned to it [27] (see Eq. (11) below). (3) Errors in recognition accuracy are those errors that can be
observed once the bunch graph is built and used to classify the postures in the testing state.
[ ] [ ] (11)
where is the optimal rotational ( ) and scaling matrix ( ) (least-square minimization approach is used to reach the
optimal) applied to :
[
] (12)
where is the vector representation of the coordinates the points in each image (
), Also, is the
optimal translation parameter.
The semi-automatic approach allows the user to correct manually those points that were detected automatically.
The correction is done by subjective observation, while the automatic method does not allow re-placing the nodes
once found. To this end, the tradeoffs between the semi-automatic and automatic approaches are the time-saving
with higher matching cost and transformation error and thus affecting the recognition accuracy.
3.2. Weighted Weighting on Features
We propose to assign a weight to each node of the graph. The standard approach assumes that equal weights are
given to every node in the bunch graph when determining the similarity function for graph matching. However,
some features of the hand are more dominant than others, in terms of their discriminative power. Thus, the
importance (weight) over the nodes should be considered to reflect this attribute within the total similarity metric
. The similarity metric is weighted by the coefficient vector that represents the discriminatory degree of each
node:
∑ ( ( ) ( ( ))) (11)
where is the bunch graph with node index , and ( ) is the jet computed from the target image taken at node
vector . The adaptive boosting described at Section 2.3 is used to train a strong classifier to classify the observed
vectors based on the score . For different hand postures, the classifiers are trained separately. Positive (true hits)
are created by extracting the feature vectors assigned the nodes in the positive images. Negative samples are feature
vectors extracted by searching the best matching location of a bunch graph in the negative set of images from the
training set (this method is broadly used to find negative instances that could potentially be recognized as true hits).
Figure 1 shows the similarity response of a sample image when the similarity metric is computed with and without
weight assignment (the bunch graph is scanned over the entire image with an increment of 4 pixels).
Fig. 1. Similarity responses of bunch graph matched to an example image (a) with weight; (b) without weight
As can be seen, the similarity response when weight is used (the left image) is more ‘focused’ in a single point
than the response without weight (the right image). In other words, the similarity scores of the entire image exhibit
global maxima when weight is applied. The more focused the response is, the fewer local maxima, which provides
more effective and reliable decision criterion.
Fig. 2. Weight distribution on an example image
Figure 2 shows the importance of the nodes represented by a heat map (the edges are omitted to emphasize the
nodes coloring system). Warm colors represent high weight, while cold ones represent low weight. As can be seen,
for those nodes with positions that blend with the background, lower weights are assigned (yellow color). On the
other hand, those nodes over the rim of the hand are assigned higher weights (warmer colors) since they are more
distinct from the background, and more descriptive of the hand.
3.3. Determining dominance of Features
According to the testing results shown in Section 3.2, the ability to better discriminate features leads to a better
decision surface which enabling a more reliable classification. Furthermore, the fact that some features have
assigned lower hierarchies indicates that their effect on the classification performance is low compared to those
features assigned with higher hierarchies. Thus, the computation of these features can be skipped without affecting
the recognition accuracy substantially. To explore the effect of the number of features selected on the algorithm,
XY
Similarity Response without Hierarchy on Features
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
18
20 0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
X
Y
Similarity Response with Hierarchy on Features
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
18
20
performance (the classification results) of three scenarios were studied, where in each, the features (the nodes) were
selected in different fashions:
1. Selection by weight: Sort the features by their assigned weights in descending order and select the N
highest ranked features.
2. Selection by the magnitude of similarity: Sort the features by their magnitude of similarity score and select
N highest ranked features.
3. Random selection: Randomly order the feature in a list, and select the N highest ranked features.
4. Experimental Results
The proposed methods were validated with the Triesch’s hand posture dataset [28]. The dataset consists of 10
difference hand gestures against complex, light and dark backgrounds; performed by 24 people. This results in a
total of 710 gray-scale 128x128 pixels each image. Each bunch graph was created by selecting two instances of a
given posture performed by three subjects against light and dark backgrounds (total six instances in each bunch
graph). This constitutes the dictionary used. The geometry of the nodes (their position) on the bunch graph was
averaged from the six graphs. Overall 60 images were used to create the bunch graphs. The remaining 650 images
were used for the training and testing datasets. The results presented correspond to the classification performance
among the extracted features from those 650 images. Examples showing the WEGM’s detection performance are
showed in Figure 3. The corresponding bunch graphs were fitted to 10 hand postures. Each image was scanned
increments of 4 pixels in the horizontal and vertical directions.
Fig. 3. 10 classes of sample hand gesture images after matching process
The various colors from warm to cold colors were used to represent nodes’ hierarchies. Light blue lines indicate
the edges. The edges were allowed to distort to reflect the variation of gesture among images within the same
category. The green dots represent the annotated nodes.
Several RGB images were captured to test the WEGM detection algorithm. The images were resized to 128x128
pixels and the bunch graphs were scanned over the image by an increment of 4 pixels. The matching results of
several examples of three hand gestures’ images with light, dark, and complex backgrounds are shown in Figure 4.
Fig. 4. Example hand gestures RGB images after the matching process
4.1. Hand Gesture Classification
The Receiver Operating Characteristic (ROC) curves are presented in Figure 5. The curves were generated using
5-fold cross-validation for the 10 hand gestures. A true positive was determined based on whether the classification
score was greater than a given threshold (found empirically), otherwise it was regarded as a miss. When an
observation was classified as a certain gesture, which was in fact a true negative, this event was considered as a false
alarm. Following this guideline, ROC curves were plotted to show the relationship between the true positives and
false alarms among the 10 classes, one for each hand gesture. The average recognition accuracy was 91.84%. This
value was found by averaging all the 10 recognition accuracies on the operational point (the point closest to the top
left corner of the graph).
Fig. 5. ROC curve for weight-based hand gesture recognition
The second metric to evaluate the hand gesture classification performance was the maximum score over the 10
classifiers (Max-Wins rule). This metric always assures a single detection (correct or incorrect), and no false
positive cases. If the maximum score fell on the incorrect class, that gesture was misclassified (it was considered a
confusion). The confusion matrix (see Figure 6) was calculated by comparing the scores delivered by each classifier
on a given sample image, and taking the maximum from all the classifiers. The average accuracy of correct
classification over the confusion matrix reached 97.08%. Both these values show better performance to those
reported in the literature [16,21]. In order to show the improvement to be significant, the t-Test of paired two
samples (650 observations for each) for equal means is conducted on the classification results of WEGM and EGM
[16]. The one tailed p-value (1.5665E-06 < .05) of the statistical test indicated that the classification is statistically
significant regarding the performance.
Fig. 6. Confusion Matrix for 10 Gestures.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
Tru
e P
ositiv
e R
ate
Gesture A
Gesture B
Gesture C
Gesture D
Gesture G
Gesture H
Gesture I
Gesture L
Gesture V
Gesture Y
Average
95.38 1.54 1.54 1.54
1.54 98.46
1.54 96.92 1.54
1.54 96.92 1.54
1.54 98.46
95.38 3.08 1.54
3.08 1.54 95.38
1.54 98.46
100
1.54 1.54 1.54 95.38
A B C D G H I L V Y
A
B
C
D
G
H
I
L
V
Y
90
91
92
93
94
95
96
97
98
99
100
4.2. Facial Expression Classification
To illustrate generality of weighted feature approach, the algorithm is tested on the MacBrain Face Stimulus
database [29]. The dataset consists of 16 emotions performed by 40 people. The dictionary includes 96 faces to
build the bunch graph per face. The nodes of the bunch graph are annotated on the fiducial points such as inner and
outer corners of the eyes, inner and outer end of the eyebrows, tips of the nose, and corners of the mouth. We use
480 images (30 faces per facial expression) to conduct 5-fold cross-validation. The confusion matrix (see Figure 7)
of classified emotions presents 96.88% accuracy of correct classification. Figure 8 shows 16 facial expressions when
the bunch graph showing the weighted node is applied. The results show the WEGM algorithm is also applicable to
other types of human features (facial expression) classification.
Fig. 7. Confusion Matrix for 16 Facial Expressions.
Fig. 8. Examples of 16 facial expression images used with matched bunch graphs.
96.67 3.33
93.33 6.67
96.67 3.33
96.67 3.33
3.33 96.67
100
100
96.67 3.33
100
100
100
3.33 90 6.67
100
100
3.33 93.33
3.33 3.33 3.33 3.33 90
AnC AnO CaC CaO DiC DiO FeC FeO HaC HaO HaX NeC NeO SaC SaO SuO
Angry_C
Angry_O
Calm_C
Calm_O
Disgust_C
Disgust_O
Fear_C
Fear_O
Happy_C
Happy_O
Happy_X
Neutral_C
Neutral_O
Sad_C
Sad_O
Surprise_O
80
82
84
86
88
90
92
94
96
98
100
4.3. Weight-Based Feature Selection
Three different scenarios were studied to validate the effect of the number of features selected (and how they
were selected), on the classification accuracy. Figure 7 shows the recognition results when applying the three
different feature reduction policies (weights, magnitude of similarity and random). Once the features/nodes were
sorted, only the N top percentage of the sorted list was selected to determine the effects on recognition accuracy.
Nine cases were evaluated from100% (no reduction) to 10% with decrements of 10% of the total number of features.
The responses are presented in Figure 9. It can be seen that up to 30% of the nodes can be discarded without
reducing the recognition accuracy below 90% when the first selection policy was applied. The recognition accuracy
decreases at a pace slower than the other two scenarios, (selected by magnitude of the similarity and randomly). The
worst results occurred when features were discarded randomly. When the second scenario was applied (the features
were selected by the sorted magnitude of similarity score), 50% of the nodes were required to assure 90% of
recognition accuracy. It can be seen that in this scenario, the overall performance was not good as in the first
scenario, but still better than when the selection of nodes was random. Thus, the experimental results show that
using the WEGM method, the computation time can be reduced by 30% by discarding those nodes which have not a
significant effect on the overall recognition accuracy.
Fig. 9. Recognition Accuracy vs. Reduced Features.
4.4. Performance on Different Annotation Techniques
In this section the performance of each annotation technique used to create the bunch graph is discussed. In the
automatic and semi-automatic methods candidate nodes were found in highly textured regions inside the hand. The
semi-automatic method allowed nodes to be adjusted manually after detected. The results displayed in Figure 10
illustrate the performance measures when using the three different methods to annotate the nodes used in the bunch
graphs. Three classifiers were trained using the three different annotation methods, and tested with light and dark
background images. When using the semi-automatic technique tested with light and dark background images, the
recognition error (7.88%) was less than the other two methods (9.07%, and 10.74% for manually and automatically,
respectively). The normalized matching cost was the highest for the automatic technique due to the inconsistency of
the nodes’ position among the graphs. For the similar reason, the normalized transformation error was also the
100 90 80 70 60 50 40 30 20 100.4
0.5
0.6
0.7
0.8
0.9
1
Percentage of Feature Used for Classification (%)
Re
co
gn
itio
n A
ccu
racy
Selection by weight
Selection by the magnitude of similarity
Random Selection
highest for the automatic technique. However, the costs and errors of matching between manual and semi-automatic
approaches were comparable. The recognition error was slightly greater for the manual case. Although matching
costs and errors of the semi-automatic method were slightly greater than those using the manual method, these
measures were substantially less than those when using automatic method. Therefore, there is a trade-off between
recognition error and speed of creating the annotation, which are expressed by the high matching costs and
transformation errors. The proposed semi-automatic technique is efficient annotation method for building up the
bunch graph faster when desired recognition accuracy is acceptable.
Fig. 10. Performance Measures for different annotation techniques.
5. Conclusion
This research proposed an enhanced graph-based approach incorporating the concept of nodes weight (WEGM)
to recognize a lexicon of ten hand gestures. The WEGM algorithm was validated using a standard dataset of
postures against three different backgrounds: light, dark and complex. The WEGM algorithm classified the postures
with a recognition accuracy of 97.08% on average. This shows that introducing weight in the bunch graphs improves
the overall performance. The reason for this is that the WEGM augment the discriminatory power of the nodes for
each gesture with respect to the remaining gestures. Furthermore, by computing the features of only the nodes with a
relative high weight, and discarding the rest, the recognition performance is not significantly affected. Thus, the
WEGM approach improves the recognition performance while reducing the computational time required for
computing the features.
Additionally, semi-automatic and automatic annotation techniques were proposed which allow the flexible
selection of nodes which are consistent between images of the same posture. The semi-automatic approach showed
to deliver the highest recognition accuracy (lowest recognition error) though not the least matching costs and
transformation, compared to the manual and automatic methods to construct the bunch graphs.
Future works includes extending the WEGM algorithm to include depth information with color. One simple
approach would be to use the range information to have a good initial region of interest for matching the WEGM
with the target image. This will result in a smaller search and will reduce the overall computation time. In addition,
we are interested in experimenting with multimodal images (thermal, depth and color) and suggest an efficient
Manual Semi-automatic Automatic0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Annotation Method
Pe
rfo
rma
nce
Me
asu
re
Matching Cost
Recognition Error
Transformation Error
method to combine these modalities to enhance overall performance. We plan to experiment with other features, like
Wavelets, and HOGs, and include a larger and more complex datasets. Finally, we will develop a parallel
implementation of this algorithm for real-time detection and classification of hand gestures.
Reference
[1] R. Poppe, R. Rienks, Evaluating the Future of HCI: Challenges for the Evaluation of Emerging Applications, Proceedings of the International
conference on Artificial intelligence for human computing 4451 (2007) pp. 234-250.
[2] S. M. M. Roomi, R. J. Priya, H. Jayalakshmi, Hand Gesture Recognition for Human-Computer Interaction, Journal of Computer Science 6(9)
(2010) pp. 1002-1007.
[3] S. S. Rautaray, A. Agrawal, Interaction with virtual game through hand gesture recogni-tion, International Conference on Multimedia, Signal
Processing and Communication Tech-nologies (IMPACT) (Dec. 2011) pp.244-247.
[4] Y.-J. Chang, S.-F. Chen, A.-F. Chuang, A gesture recognition system to transition auto-nomously through vocational tasks for individuals
with cognitive impairments, Research in Developmental Disabilities 32(6) (2011) pp. 2064-2068.
[5] T. Leyvand, C. Meekhof, Yi-Chen Wei, Jian Sun, Baining Guo, Kinect Identity: Technology and Experience, Computer, 44(4) (2011) pp. 94-
96.
[6] J.P. Wachs, H. I. Stern, Y. Edan, M. Gillam, J. Handler, C. Feied, M. Smith, A gesture-based tool for sterile browsing of radiology images,
Journal of the American Medical Infor-matics Association 15(3) (2008) pp. 321–323.
[7] K. Wood, C. E. Lathan, K. R. Kaufman, Development of an interactive upper extremity gestural robotic feedback system: From bench to
reality, Annual International Conference of the IEEE on Engineering in Medicine and Biology Society (EMBC) (Sept. 2009) pp. 5973-5976.
[8] S. Bilal, R. Akmeliawati, M. J. El Salami, A. A. Shafie, Vision-based hand posture detection and recognition for Sign Language - A study,
2011 4th International Conference On Mechatronics (ICOM) (May 2011) pp.1-6.
[9] M. de La Gorce, D. J. Fleet, N. Paragios, Model-Based 3D Hand Pose Estimation from Monocular Video, IEEE Transactions on Pattern
Analysis and Machine Intelligence 33(9) (2011) pp. 1793-1805.
[10] S. Koelstra, M. Pantic, I. Patras, A Dynamic Texture-Based Approach to Recognition of Facial Actions and Their Temporal Models, IEEE
Transactions on Pattern Analysis and Ma-chine Intelligence 32(11) (Nov. 2010) pp. 1940-1954.
[11] Weiqi Yuan, Lantao Jing, Hand-Shape Feature Selection and Recognition Performance Analysis, 2011 International Conference on Hand
Based Biometrics (Nov. 2011) pp. 1-6.
[12] J. P. Wachs, M. Kölsch, H. Stern, Y. Edan, Vision-based handgesture applications, Communications of the ACM 54(2) (2011) pp. 60-71.
[13] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, Face recognition by elastic bunch graph matching, International Conference on
Image Processing 1 (1997) pp.129-132.
[14] H.-C. Shin, S.-D. Kim, H.-C. Choi, Generalized elastic graph matching for face recognition, Pattern Recognition Letters 28(9) (2007) pp.
1077-1082.
[15] A. Tefas, A. Kotropoulos, I. Pitas, Face verification using elastic graph matching based on morphological signal decomposition, Signal
Processing 82(6) (2002) pp. 833-851.
[16] J. Triesch, C. von der Malsburg, Robust classification of hand postures against complex backgrounds, Proceedings of the Second
International Conference on Automatic Face and Gesture Recognition (Oct. 1996) pp.170-175.
[17] M. A. Turk, A.P. Pentland, Face recognition using eigenfaces, IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (Jun 1991) pp. 586-591.
[18] Jun Zhang, Yong Yan, M. Lades, Face recognition: eigenface, elastic matching, and neural nets, Proceedings of the IEEE 85(9) (Sept. 1997)
pp. 1423-1435.
[19] C. Kotropoulos, A. Tefas, I. Pitas, Frontal face authentication using morphological elastic graph matching, IEEE Transactions on Image
Processing 9(4) (Apr. 2000) pp. 555-560.
[20] Y.-T. Li, J. P. Wachs, Hierarchical Elastic Graph Matching for Hand Gesture Recognition Progress in Pattern Recognition, Image Analysis,
Computer Vision, and Applications, L. Alvarez, M. Mejail, L. Gomez, and J. Jacobo, Eds. Springer Berlin Heidelberg (2012) pp. 308-315. doi:
10.1007/978-3-642-33275-3_38
[21] P. P. Kumar, P. Vadakkepat, Loh Ai Poh, Graph matching based hand posture recognition using neuro-biologically inspired features, 11th
International Conference on Control Auto-mation Robotics & Vision (ICARCV) (Dec. 2010) pp. 1151-1156.
[22] J. H. Friedman, Another approach to polychotomous classification, Technical report, Standford Department of Statistics, 1996.
[23] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics, 28(2) (2000) pp. 337-
374.
[24] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing features: efficient boosting procedures for multiclass object detection, Proceedings of
the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2 (July 2004) pp. II-762- II-769.
[25] A. Torralba, K. P. Murphy, W. T. Freeman, Sharing Visual Features for Multiclass and Multiview Object Detection, IEEE Transactions on
Pattern Analysis and Machine Intelligence 29(5) (May 2007) pp.854-869.
[26] C. Harris, M. Stephens, A combined corner and edge detector, Fourth Alvey Vision Conference, Manchester, UK (1988) pp. 147-151.
[27] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis, and Machine Vision, Thomas Engineering, Toronto Canada, 3rd edition, 2008.
[28] Sebastien marcel hand posture and gesture datasets: Jochen triesch static hand posture database. http://www.idiap.ch/resource/gestures/
[29] N. Tottenham, J. Tanaka, A. C. Leon, T. McCarry, M. Nurse, T. A. Marcus, A. Westerlund, B. J. Casey, C. A. Nelson, The NimStim set of
facial expressions: judgments from untrained research participants, Psychiatry Research 168(3) (2009) pp. 242-9.