Tentap: a piano playing gesture recognition system...
-
Upload
dinhkhuong -
Category
Documents
-
view
242 -
download
2
Transcript of Tentap: a piano playing gesture recognition system...
Tentap: a piano-playing gesture recognition system based on ten
fingers for virtual piano
Kyeongeun Seo
Korea University
Sejong city, South Korea
Hyeonjoong Cho *
Korea University
Sejong city, South Korea
ABSTRACT
We propose a system that recognizes 32 gestures to provide all
possible tap combinations and finds applicability to the head
mounted display (HMD) for playing virtual piano using an RGB-
D camera. While several existing hand interaction algorithms
have introduced a mid-air interaction using a sensor installed in
front of a user, our system is designed to recognize hand
interaction with a planar object using a sensor installed over a
user’s head. It detect the location of hands and recognize tap down
and up gestures on a planer object. The proposed system consists
of three procedures, i.e., hand detection, hand pose estimation,
and gesture classification. Especially, the hand pose estimation is
performed with 3D Convolutional Neural Network (3DCNN)
which uses a temporal and a spatial information. Additionally, the
gesture classification is performed with Support Vector Machine
(SVM) classifiers and normalized 3D hand positions being
invariant to scale, viewpoint, and hand-orientation. To train and
validate the system, we collect 240K data where each element
consists of a depth image with a RGB-D camera and a hand pose
with an optical motion capture system. Preliminary results show
that our method achieves the hand pose estimation error of about
10mm and the gesture classification accuracy of 83% which is
about 10% improvement over a state-of-the-art method. 1
CCS CONCEPTS
• Human-centered computing → Human computer interaction
(HCI); Interaction techniques
ADDITIONAL KEYWORDS AND PHRASES
Hand Pose Estimation, Gesture classification, Tap Detection,
Human Computer Interaction
1 INTRODUCTION
In recent years, a significant amount of research has been
introduced to estimate hand poses and recognize hand gestures
using a consumer depth sensor [1,7]. We found that most of them
assume that the hands are in mid-air. Such assumption diminishes
their utilities for the applications of virtual instrument due to two
* Corresponding author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).i3D 2018 Posters, May 15–18, 2018, Montreal, Canada© 2018 Copyright held by the owner/author(s).
Figure 1: Experimental setups. (a) Front view of the environment,
(b) top view of the environment, (c) a kinectV2 depth camera, (d)
a IR camera, (e) a hand attached six IR tags.
reasons[5]: users get tired quickly and no tactile feedback. To
overcome these drawbacks, Barehanded Music was developed to
allow users to put their hands and then tap fingers on the planar
object [5]. The system though was restricted to support simple six
tap gestures.
We propose a system that recognizes 32 gestures to interact
with a planar object for playing virtual piano from a sensor
installed over a user’s head as shown in Fig 1. Our setup can be
easily applicable to the HDM, i.e., a depth sensor is equipped with
the headset. We recognize 32 gestures consisted of none tap
gesture and 31 tap gestures including all possible tap-down
combinations with five fingers to provide piano chords. For this
system, we have three significant challenges. First, a fingertip is
highly occluded at the moment of bending a finger when users tap
on a planer object because a sensor installed above a user’s head,
not above a user’s hand. Second, it is hard to recognize gestures
with a naive threshold method using a spatial information because
its classification performance is highly dependent on the accuracy
of an input than a method using a spatial and a temporal
information. Third, there is no publicly released dataset consisted
of the sequences of labeled hand gestures with a depth image and
3D positions of hand joints to train and validate our algorithm.
To tackle these problems, we propose a system, Tentap,
consisted of three procedures, i.e., hand detection, hand pose
estimation, gesture classification. In the hand detection, we detect
the region of a hand with a traditional image processing technique.
In the hand pose estimation, we estimate a hand pose expressed by
the positions including five fingertips and a wrist. The hand pose
is estimated by a trained 3DCNN [4] which uses a series of
preceding images where the last image is a current. In the gesture
classification, we obtain a series of 3D hand positions to consider
a spatial and a temporal information, normalize them to extract
invariant positions, and train SVM classifiers with the normalized
data. To train and validate our system, we collect 240K dataset
consisted of the real depth images and high-quality hand pose
annotations by a Kinect V2 and an OptiTrack system with five
infrared (IR) cameras and six IR tags.
(a) (b)
(c)
(d)
(e)
Figure 2: 3DCNN structure for the hand pose estimation.
2 TENTAP SYSTEM
In our system, we use a stream of depth images and obtain three
outputs, i.e., translation offset, a hand pose, and gesture classes.
The translation offset and the hand pose are used for the 3D
absolute position of the hand where the user’s hand was located
on a planar object. The gesture classes are used to provide 32
gestures. To extract outputs, we sequentially perform three
procedures.
In the hand detection, we obtain a segmented depth image for
an input of the hand pose estimation and a value of translation
offset which is the position of user’s wrist because it is a stable
landmark. First, we perform pre-processing steps on a depth
image as follows: (1) building a background model, (2)
subtracting the background from the depth image with the
Background model, (3) binarizing the background-removed image,
and (4) finding contours of the binarized image. Second, we find a
rectangle based on contour approximation from contours and crop
the depth image with the coordinates of the rectangle. Third, we
obtain the value of translation offset by comparing the width
lengths of contours. We use the width lengths from the end of
user’s arm to his hand because the width just above a wrist is
relatively bigger than that of the arm.
In the hand pose estimation, we infer 3D positions from a set of
segmented 20 images and a 3DCNN [4] as shown in Fig 2. We
use a residual connection for rapid and easier optimization and a
max pooling layer for reducing dimension and providing some
translation invariance. And the FC1 layer and FC2 layer are used
for dimension reduction and for 18 outputs each. For training
3DCNN, we perform pre-processing steps on an input and an
output. For input, we resize the segmented images into 48X48
resolution and normalize the values of depth pixels to [0,1] by
mean normalization. As output, we use relative 3D positions
which can be obtained by subtracting 3D positions from a wrist
position. In training 3DCNN, we optimize the hyper-parameters
by minimizing the value of a loss which is defined as the
difference between the expected 3D positions and the estimated
3D positions. To optimize the hyper-parameters, we set the
learning rate to 0.0005 and use the momentum of 0.9. The batch
size is 128 and the number of epochs is 65. We apply two
regularization techniques, weight decay (r=0.5%) and drop out
(p=0.5), to reduce overfitting.
In the gesture classification, we recognize 32 gestures with five
classes from five SVM classifiers. For classifying and training
SVM classifiers, a set of 20 consecutive 3D positions are
collected and bundles of five consecutive 3D positions are used
for calculating the average of each 3D position. The averages of
3D positions are normalized by Skeleton Quad descriptor, where
described in [2]. Before training, we apply the SMOTE+Tomek
technique on the training set to avoid a class imbalance problem.
Figure 3: Percentages of average hand pose prediction accuracy
with a different distance threshold.
3 EXPERIMENT
We set an environment to collect data as shown in Fig 1. The
resolution of kinectV2 is 512 × 424 with 56~62 FPS rate. An
OptiTrack system produced accurate 3D positions of IR tags
attached on a hand from IR cameras. Five subjects performed 25
sessions including all gestures with two hands for each session.
We required subjects moving their hands side by side, reducing/
increasing space between fingers, and tapping quickly or slowly
on an unmarked plane. We captured 240k depth images including
6.3k images as a tap gestures. A tap gesture is defined as the
moment after the finger moves down and finally reaches a planar
object threshold. We found an image of that moment and then we
labeled the image and its previous nine images as a tap gesture.
To train and validate our algorithms, we performed hold-out
cross-validation with 70/30% train/test split of our dataset. We
compared our 3DCNN to a state-of-the art architecture, i.e. Basic
network [3]. We obtained the average 10 mm error. Fig 3 shows
the performance with two evaluation metrics as described in [6].
To compare our gesture classification algorithm, we implemented
Barehanded Music [5]. The accuracy of our algorithm was 83%,
which was a 10% improvement than Barehanded Music (75%).
The accuracies of 27 gestures out of 32 were more than 80%.
4 CONCLUSION
We presented a system, Tentap, to estimate 3D positions and
detect 32 gestures for virtual piano. We collected a new dataset
for training and validation our algorithms and outperformed on
the hand pose estimation and the gesture classification in our
setup. In the future, we will consider a piano’s physical action and
a processing time for providing more realistic virtual piano.
ACKNOWLEDGEMENTS
This work was granted by Industry and Energy (10085608, 2017)
REFERENCES [1] De Smedt, Q., Wannous, H., & Vandeborre, J. P. 2016. Skeleton-based
dynamic hand gesture recognition. In CVPRW 2016, pp. 1206-1214.
[2] Evangelidis, G., Singh, G., & Horaud, R. 2014. Skeletal quads: Human action
recognition using joint quadruples. In ICPR 2014, pp. 4513-4518.
[3] Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., & Yang, H. 2017. Region
ensemble network: Improving convolutional network for hand pose estimation.
arXiv preprint arXiv:1702.02447.
[4] Ji, S., Xu, W., Yang, M., & Yu, K. 2013. 3D convolutional neural networks for
human action recognition. Pattern analysis and machine intelligence
[5] Liang, H., Wang, J., Sun, Q., Liu, Y. J., Yuan, J., Luo, J., & He, Y. 2016.
Barehanded music: real-time hand interaction for virtual piano. In I3D. ACM.
[6] Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep
learning for hand pose estimation. arXiv preprint arXiv:1502.06807.
[7] Yi, X., Yu, C., Zhang, M., Gao, S., Sun, K., & Shi, Y. 2015. Atk: Enabling ten-
finger freehand typing in air based on 3d hand tracking data. In UIST. ACM.
5X5X
5 co
nv1
/
20X48X48X4
5X5X
5 co
nv2
/
20X48
X48
X4
1X3X3 M
ax p
ool1
/
10X24
X24
X4
3X3X
3 co
nv3
/
10X24X24X16
3X3X
3 co
nv4
/
10X24
X24
X16
3X3X
3 co
nv5
/
5X12X12X32
3X3X
3 co
nv6
/
5X12X12X32
1X3X3 M
ax p
ool2
/
5X12
X12
X16
FC1/
512
FC2/
18
Output 18
+
1X3X3 M
ax p
ool3
/
3X6X
6X32
+ +
0
20
40
60
80
100
0 10 20 30 40 50
Fra
ctio
n o
f fr
am
es
wh
tin
Dis
tan
ce /
%
Distance thrshold / mm
Basic resnet
Ours
i3D 2018 Posters, May 15–18, 2018, Montreal, Canada Seo and Cho