Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

8
Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models q Sansanee Auephanwiriyakul a,b,, Suwannee Phitakwinai a , Wattanapong Suttapak c , Phonkrit Chanda a , Nipon Theera-Umpon b,d a Department of Computer Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand b Biomedical Engineering Center, Chiang Mai University, Chiang Mai 50200, Thailand c Software Engineering, School of Information and Communication Technology, University of Phayao, Phayao 56000, Thailand d Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand article info Article history: Received 6 September 2012 Available online 28 April 2013 Communicated by A. Petrosino Keywords: Scale Invariant Feature Transform Hidden Markov Models Sign language Hand gesture recognition abstract Visual communication is important for a deft and/or mute person. It is also one of the tools for the com- munication between human and machines. In this paper, we develop an automatic Thai sign language translation system that is able to translate sign language that is not finger-spelling sign language. In par- ticular, we utilize Scale Invariant Feature Transform (SIFT) to match a test frame with observation sym- bols from keypoint descriptors collected in the signature library. These keypoint descriptors are computed from several keyframes that are recorded at different times of day for several days from five subjects. Hidden Markov Models (HMMs) are then used to translate observation sequences to words. We also collect Thai sign language videos from 20 subjects for testing. The system achieves approxi- mately 86–95% for signer-dependent on the average, 79.75% for signer-semi-independent (same subjects used in the HMM training only) on the average and 76.56% for signer-independent on the average. These results are from the constrained system in which each signer wears a shirt with long sleeves in front of dark background. The unconstrained system in which each signer does not wear a long-sleeve shirt in front of various natural backgrounds yields a good result of around 74% on the average on the signer- independent experiment. The important feature of the proposed system is the consideration of shapes and positions of fingers, in addition to hand information. This feature provides the system ability to rec- ognize the hand sign words that have similar gestures. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction There are several forms of communication among people. One of those is visual communication, e.g., hand gesture, body lan- guage, etc. This type of communication is a useful tool to improve a quality of life for deaf or non-vocal persons. Besides being the communication between humans, visual communication becomes one of the important communication channels between human and machines. Deaf or non-vocal persons communicate using sign language or hand gesture. However, there is a barrier between hearing and deaf persons because most hearing persons cannot understand the language. Hence, the problem of recognizing a sign language be- comes an interesting and popular research area. There are several approaches (Kramer and Leifer, 1989; Fels and Hinton, 1993; Su et al., 1996; Chen et al., 2002; Wu et al., 2000; Gao et al., 2000; Wu et al., 1998; Gao et al., 2004; Ibarguren et al., 2012; Oz and Leu, 2011) that have been proposed in this research area. However, these approaches are based on the idea that the hand gesture data is collected through a cyber-glove or position tracking attached to the body of the subject. Color gloves with a particular pattern were also used to cope with the feature point tracking problem (Wang et al., 2007). Moreover, the color of background and signers’ clothes are different from that of the gloves. Although, these approaches provide good recognition rates, it is more natural to recognize hand gesture without using a special hardware. There are a few re- searches that propose techniques in recognizing static hand gestures with image caption or dynamic hand gestures with video caption (Min et al., 1997; Huang et al., 1998; Kobayashi and Haruyama, 1997; Chen et al., 2003; Alon et al., 2005; Auephanwiriyakul and Chaisatian, 2004; Lee and Tsai, 2009; Al-Rouson et al., 2009; Sriboonruang et al., 2004; Yang and Sarkar, 2009; Just and Marcel, 0167-8655/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2013.04.017 q This work was supported by the Telecommunications Research and Industrial Development Institute under the contracts No. TRIDI RS. 2551/003 and TRIDI LAB 2551/001. Corresponding author at: Department of Computer Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand. Tel.: +66 5394 2024; fax: +66 5394 2072. E-mail address: [email protected] (S. Auephanwiriyakul). Pattern Recognition Letters 34 (2013) 1291–1298 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Transcript of Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

Page 1: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

Pattern Recognition Letters 34 (2013) 1291–1298

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Thai sign language translation using Scale Invariant Feature Transformand Hidden Markov Models q

Sansanee Auephanwiriyakul a,b,⇑, Suwannee Phitakwinai a, Wattanapong Suttapak c, Phonkrit Chanda a,Nipon Theera-Umpon b,d

a Department of Computer Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailandb Biomedical Engineering Center, Chiang Mai University, Chiang Mai 50200, Thailandc Software Engineering, School of Information and Communication Technology, University of Phayao, Phayao 56000, Thailandd Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand

a r t i c l e i n f o a b s t r a c t

Article history:Received 6 September 2012Available online 28 April 2013

Communicated by A. Petrosino

Keywords:Scale Invariant Feature TransformHidden Markov ModelsSign languageHand gesture recognition

0167-8655/$ - see front matter � 2013 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.patrec.2013.04.017

q This work was supported by the TelecommunicaDevelopment Institute under the contracts No. TRIDI2551/001.⇑ Corresponding author at: Department of Comp

Engineering, Chiang Mai University, Chiang Mai 5022024; fax: +66 5394 2072.

E-mail address: [email protected] (S. Auephanwir

Visual communication is important for a deft and/or mute person. It is also one of the tools for the com-munication between human and machines. In this paper, we develop an automatic Thai sign languagetranslation system that is able to translate sign language that is not finger-spelling sign language. In par-ticular, we utilize Scale Invariant Feature Transform (SIFT) to match a test frame with observation sym-bols from keypoint descriptors collected in the signature library. These keypoint descriptors arecomputed from several keyframes that are recorded at different times of day for several days from fivesubjects. Hidden Markov Models (HMMs) are then used to translate observation sequences to words.We also collect Thai sign language videos from 20 subjects for testing. The system achieves approxi-mately 86–95% for signer-dependent on the average, 79.75% for signer-semi-independent (same subjectsused in the HMM training only) on the average and 76.56% for signer-independent on the average. Theseresults are from the constrained system in which each signer wears a shirt with long sleeves in front ofdark background. The unconstrained system in which each signer does not wear a long-sleeve shirt infront of various natural backgrounds yields a good result of around 74% on the average on the signer-independent experiment. The important feature of the proposed system is the consideration of shapesand positions of fingers, in addition to hand information. This feature provides the system ability to rec-ognize the hand sign words that have similar gestures.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

There are several forms of communication among people. Oneof those is visual communication, e.g., hand gesture, body lan-guage, etc. This type of communication is a useful tool to improvea quality of life for deaf or non-vocal persons. Besides being thecommunication between humans, visual communication becomesone of the important communication channels between humanand machines.

Deaf or non-vocal persons communicate using sign language orhand gesture. However, there is a barrier between hearing and deafpersons because most hearing persons cannot understand the

ll rights reserved.

tions Research and IndustrialRS. 2551/003 and TRIDI LAB

uter Engineering, Faculty of00, Thailand. Tel.: +66 5394

iyakul).

language. Hence, the problem of recognizing a sign language be-comes an interesting and popular research area. There are severalapproaches (Kramer and Leifer, 1989; Fels and Hinton, 1993; Suet al., 1996; Chen et al., 2002; Wu et al., 2000; Gao et al., 2000;Wu et al., 1998; Gao et al., 2004; Ibarguren et al., 2012; Oz andLeu, 2011) that have been proposed in this research area. However,these approaches are based on the idea that the hand gesture datais collected through a cyber-glove or position tracking attached tothe body of the subject. Color gloves with a particular pattern werealso used to cope with the feature point tracking problem (Wanget al., 2007). Moreover, the color of background and signers’ clothesare different from that of the gloves. Although, these approachesprovide good recognition rates, it is more natural to recognize handgesture without using a special hardware. There are a few re-searches that propose techniques in recognizing static hand gestureswith image caption or dynamic hand gestures with video caption(Min et al., 1997; Huang et al., 1998; Kobayashi and Haruyama,1997; Chen et al., 2003; Alon et al., 2005; Auephanwiriyakul andChaisatian, 2004; Lee and Tsai, 2009; Al-Rouson et al., 2009;Sriboonruang et al., 2004; Yang and Sarkar, 2009; Just and Marcel,

Page 2: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

1292 S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298

2009; Stergiopoulou and Papamarkos, 2009; Shen et al., 2012;Prisacariu and Reid, 2012; Han et al., 2009; Kelly et al., 2012;Gamage et al., 2011). However, these methods need to either auto-matic or manually segment hand regions before performing anyrecognition. Although, our previous work (Phitakwinai et al.,2008) does not need any segmentation, the correct classificationrate is not very good, i.e., 60–80% for signer-dependent and20–40% for signer-independent.

In this paper, we develop Thai sign language translation systemthat is able to translate sign language that is not finger-spellingsign language. In particular, we utilize Scale Invariant FeatureTransform (SIFT) to match test frame with observation symbolsfrom keypoint descriptors collected in the signature library. Thesekeypoint descriptors are computed from several keyframes that arerecorded at different times of day for several days. Hidden MarkovModels (HMMs) are then used to translate observation sequencesto words. There are 10 models since there are 10 Thai words inthe experiment, i.e., ‘‘elder’’, ‘‘grandfather’’, ‘‘grandmother’’, ‘‘grati-tude’’, ‘‘female’’, ‘‘male’’, ‘‘glad’’, ‘‘thank you’’, ‘‘understand’’, ‘‘miss’’.We report our performance based on the correct translation fromthe signer-dependent, signer-semi-independent and signer-inde-pendent experiments. Signer-dependent means that the subjectis used in both collecting the signature library for SIFT algorithmand in the training of the HMM algorithm whereas signer-indepen-dent is the blind test data set. However, there are two types ofsigner-semi-independent. The first type is when the subject is usedin the collecting the signature library only. The second one is whenthe subject is used in the training of the HMM algorithm only. Theaforementioned experiments are performed on signers who areasked to wear a black shirt with long sleeves and stand in frontof dark background. The best system is selected to implement onsigners who are in the blind test data set without any constraints,i.e., they are asked to wear a short-sleeve shirt and stand in front ofvarious natural backgrounds.

2. System description

Although, this system recognizes only 10 words of the Thai signlanguage (Office of the Basic Education Commission, 1997), thereare actually 31 hand gestures (shown in Fig. 1(a)). From the handgestures in Fig. 1(a), we can see that many of them are very similarand, therefore, the hand gestures for the words in our data set arealso similar. To recognize these words, not only hand’s informa-tion, fingers’ shapes and positions are also required. We collectthese hand gestures from five subjects who are asked to wear ablack shirt with long sleeves and stand in front of dark background.We collect each Thai sign language several times to be our signa-ture library in the form of video files. Representative Frames(Rframes) of each video file in each Thai sign language are then se-lected manually. The region of interest in each frame (only handpart) is selected manually with the size of 190 � 190. We call eachimage in the signature library a keyframe. There are 730 keyframesfrom each subject, hence there are 3650 keyframes in total. Thenumbers of keyframes from each subject in the signature libraryare also shown in Fig. 1(a).

After we collect all keyframes in the signature library, we com-pute keypoint descriptors of each key frame using the Scale Invari-ant Feature Transform (SIFT). We keep all keypoint descriptors inthe signature library database. In order to translate Thai sign lan-guage, we need a system that grabs an image through a cameraand analyzes it. Fig. 1(b) shows the structure of the translation sys-tem. The features are obtained from the luminance (Y) Jiang et al.,1998 extracted from a video sequence. The image size from a videosequence is 720 � 576 pixels. Each image sequence is decimated sothat only 14 image frames are achieved for each video file. Then a

matched keyframe will be found from the SIFT. These matchedkeyframes represent observation symbols used in the left–rightHMMs as shown in Fig. 1(c). This observation sequence will betranslated by the HMMs later on.

Now let us briefly review the Scale Invariant Feature Trans-form (SIFT) algorithm Lowe, 2004. The SIFT finds a specific loca-tion called the keypoint in the scale space of a training image.There are four major stages used to generate a keypoint, i.e.,scale-space extrema detection, keypoint localization, orientationassignment, and keypoint descriptor. In scale-space extremadetection, a specific keypoint location is determined by construct-ing a Gaussian pyramid from the original image. Each image inthe scale space is subtracted from their nearby scales to producethe difference-of-Gaussian images. Each sample point is com-pared with eight neighbors in the same scale space and nineneighbors in each of the scale space above and below. If the pixelis greater than or less than all 26 neighbors, that location is anextremum and will be selected as a keypoint location. In keypointlocalization, a detailed fit to the nearby data for location, scale,and ratio of principal curvatures is performed. This informationallows candidate keypoints that have low contrast or are poorlylocalized along an edge to be rejected. In orientation assignment,orientations are assigned to each keypoint location based on thelocal image gradient direction. An orientation histogram is builtfrom the gradient direction of sample points around the keypoint.The magnitude of each gradient is added to the nearest orienta-tion bin of the histogram.

When the largest bin of the histogram is found, we assignorientation to that keypoint. Then the area around the keypointis rotated by the negative of the angle determined from the maxi-mum bin index. After being oriented to the zero-angle, we willthen describe the keypoint. A keypoint descriptor is created bycomputing the gradient magnitude at each pixel in a region aroundthe keypoint location. These values will then be added to the near-est bin of their 16 orientation histograms around the keypointlocation. Each of the 16 histograms has eight bins resulting in a128 dimensional feature vector. After we normalize the featurevector, the vector is added to the database for future matching pro-cess. Fig. 2(a) and (b) shows the process of computing a keypointdescriptors, while Fig. 2(c)–(e) shows examples of keypointdescriptors of three keyframes.

Matching is obtained by comparing the Euclidean distancesbetween two nearest neighbor keypoint descriptors in the signa-ture library database to the current keypoint descriptor. If the ratioof the smallest distance to the second smallest one is less than agiven threshold, then the two keypoints match. Fig. 2(f)–(h) showsmatching process between keypoints’ descriptors in signaturelibrary database and the test images.

Now, let us briefly review the Hidden Markov Models (HMMs)Rabiner, 1989; Rabiner and Juang, 1993. A discrete HMM is a sto-chastic process which is not observable but can only be observedthrough another stochastic process that generates a sequence ofsymbols. It is characterized by the state transition probability (A),the observation symbol probability (B), and the initial state proba-bility (P). We define the following model notations for discreteHMMs:

T: The length of observation sequence,N: The number of states,M: The number of observation symbols,Q = {qi}: The set of states, qi e {1,2, . . .,N}, 1 6 t 6 T,V = {vk}: The discrete set of possible symbol observations,1 6 k 6M,A = {aij}: The state transition probability distribution, whereaij ¼ Pðqtþ1 ¼ jjqt ¼ iÞ;1 6 i; j 6 N with the following propertiesaij P 0, "j,i and

Page 3: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

(a)

(b)

(c)

Test video Decimation to select 14 frames in a frame sequence

SIFT

Keypoint descriptors of each frame in the frame sequence

Match keypoint descriptors

Get symbol of each frame in the sequence

HMMs Result is a model that has maximum probability

“a” 27

keyframes

“a1” 27

keyframes

“b” 19

keyframes

“c” 18

keyframes

“cb1” 16

keyframes

“cb2” 14

keyframes

“d” 51

keyframes

“d1” 56

keyframes

“e” 14

keyframes

“ef1” 14

keyframes

“ef2” 19

keyframes

“ef3” 26

keyframes

“f” 19

keyframes

“g” 15

keyframes

“g1” 41

keyframes

“g2” 16

keyframes

“hi1” 18

keyframes

“hi2” 25

keyframes

“i” 14

keyframes

“j” 43

keyframes

“j1” 41

keyframes

“k” 14

keyframes

“kl1” 22

keyframes

“kl2” 15

keyframes

“l” 16

keyframes

“m” 15

keyframes

“mn1” 12

keyframes

“mn2” 8

keyframes

“nh” 49

keyframes

“nh1” 18

keyframes

“nh2” 28

keyframes

Fig. 1. (a) Examples of 31 hand gestures and the number of keyframes of each hand gesture in the signature library, (b) Thai sign language translation structure and (c) Left–right HMMs used in the experiment.

S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298 1293

XN

j¼1

aij ¼ 1; 8i; ð1Þ

B = {bj(vk)}: The observation symbol probability distribution,where

bjðvkÞ ¼ Pðvk at tjqt ¼ jÞ; 1 6 j 6 N;1 6 k 6 M; ð2Þ

P = {pi}: The initial state probability distribution, where

pi ¼ Pðq1 ¼ iÞ;1 6 i 6 N andXN

i¼1

pi ¼ 1: ð3Þ

The compact notation of each HMM is defined by k = (A, B, P). Forclassification problem, the goal is to classify the unknown classobservation sequence O into one of c classes. If we denote the cmodels by kc, 1 6 c 6 C, then this observation sequence is classifiedto class c⁄, where c� ¼ arg max

c2CPðOjkcÞ.

Page 4: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

(a)

(b)

50 100 150

50

100

150

50 100 150

50

100

150

50 100 150

50

100

150

(c) (d) (e)

0 2 4 6 8 10 12 14 16 180

2

4

6

8

10

12

14

16

18

02

46

810

1214

1618

0

2

4

6

8

10

12

14

16

18

100 200 300 400 500 600 700 800 900

50

100

150

200

250

300

350

400

450

500

550

(f)

(g)

(h)

Fig. 2. Keypoint descriptor generation (a) Keypoints found on a keyframe, (b) The description of one keypoint. After a keypoint is found in 2(a), the gradients are determinedand a major orientation is found. The area around the keypoint is rotated by this major angle. Then the gradients are added to the histogram bins that will give the descriptionof this keypoint. Key point descriptors found on hand gesture (c) ‘‘c’’, (d) ‘‘d’’, and (e) ‘‘f’’, and a hand gesture name (f) ‘‘a’’ assigned to test image using SIFT and test frames withconstraint, (g) and (h) ‘‘j’’ and ‘‘g’’ are assigned to test image using SIFT and test frames without constraint.

1294 S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298

In this work, the symbols of HMMs are the hand gesture namesassigned to all 14 frames in the frame sequence by the SIFT. Hence,M and T are equal to 31 and 14, respectively. For the simplicity wecall a hand gesture name a symbol from now on. Each class is rep-resented by the model trained for each assigned word. If we get anunknown frame sequence, we have to calculate PðOjkcÞ for all c andclassify this frame sequence into the class which yields themaximum probability. The Baum–Welch method (Rabiner, 1989;Rabiner and Juang, 1993) is used for training the system to findthe best model of each assigned word. The Viterbi algorithm(Rabiner, 1989; Rabiner and Juang, 1993) is applied to classifythe unknown frame sequence.

3. Experimental results

The training and test video data sets are recorded at differenttimes of day for several days from 20 subjects. Subjects 1–5 arethe same subjects who are recorded for keyframes collection inthe signature library. In the video, the subjects are asked to weara black shirt with long sleeves and stand in front of darkbackground while they are performing the Thai sign language.There are 10 words in the experiment, i.e., ‘‘elder’’, ‘‘grandfather’’,

‘‘grandmother’’, ‘‘gratitude’’, ‘‘female’’, ‘‘male’’, ‘‘glad’’, ‘‘thankyou’’, ‘‘understand’’, ‘‘miss’’. The hand part in each image is selectedmanually into 190 � 190 image frame (keyframe).

The testing videos are recorded from subjects 1–20. We deci-mated 14 frames from each video. Each frame will be matched toa representative symbol using SIFT as the following. In the experi-ment, the SIFT thresholds in the assigning symbol process are var-ied from 0.65, 0.7, and 0.75. Since, the keyframes for each symbolin the signature library may have different numbers of keypoints,we compute an average number of matched keypoints per key-frames (Ave_Match) of each symbol as

Ave Match ¼ No: of matched keypoints of the symbolNo: of key frames of the symbol

ð4Þ

Then we pick the matched symbol by choosing the one that givesmaximum Ave_Match. An example of the computed Ave_Match ofthe symbols ‘‘e’’ and ‘‘ef1’’ are shown in Fig. 3(a) and (b). Thematched symbols in Fig. 3(a) and (b) are ‘‘e’’ and ‘‘ef1’’ withAve_Match of 11.43 and 4.36, respectively.

The numbers of words in the training data set of the HMMs areshown in Table 1, while that of the blind test data set of all 20subjects are shown in Table 1 as well. To make it easy in the

Page 5: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

a a1 b c cb1cb2 d d1 e ef1ef2ef3 f g g1 g2 hi1hi2 i j j1 k kl1kl2 l mmn1mn2nhnh1nh20

2

4

6

8

10

12

SymbolsA

vera

ge o

f m

atch

ing

(a)

a a1 b c cb1cb2 d d1 e ef1ef2ef3 f g g1 g2 hi1hi2 i j j1 k kl1kl2 l mmn1mn2nhnh1nh20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Symbols

Ave

rage

of

mat

chin

g

(b)

Fig. 3. Ave_Match of symbols (a) matched symbol is ‘‘e’’ and (b) matched symbol is ‘‘ef1’’.

S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298 1295

experiment report, we will call the training data set of subjects1–15 as data sets 1a–15a and the blind test data set of subjects1–15 as data sets 1b–15b. The blind test data set (used to representsigner-independent) are from subjects 17–20. We assign symbolsto each frame in the training data set using the SIFT. Althoughthere are 10 HMMs to classify 10 words, the length of observation

Table 1Number of words in the training data set from subjects 1–15 and in test data set from su

Subject (elder) (grandfather) (grandmother) (gratitu

Train data set 1a 36 36 36 362a–15a 32 32 32 32

Test data set 1b 12 12 12 122b–15b 8 8 8 816–19 20 20 20 2020 10 10 10 10

Table 2Classification rate of the validation set from 4-fold cross validation when training with da

SIFT threshold 1st cross validation 2

Train with 1a 0.65 97.62 90.7 97.62 90.75 92.86 9

Train with 1a–5a 0.65 92.08 90.7 93.32 90.75 87.87 8

Train with 1a–15a 0.65 83.72 80.7 82.39 80.75 74.92 8

sequence (T) in each model is the same, i.e., 14. All HMMs used inthe experiments are left–right model as shown in Fig. 1(c). Thenumber of states (N) in each model is different, i.e., N for ‘‘elder’’,‘‘grandfather’’, grandmother’’, ‘‘gratitude’’, ‘‘male’’, ‘‘female’’,‘‘glad’’, ‘‘thank you’’, ‘‘understand’’, and ‘‘miss’’ is 3, 4, 3, 5, 3, 4, 3,4, 4, and 5, respectively. For example, the symbols appearing in

bject 1–20.

de) (male) (female) (glad) (thank you) (understand) (miss)

36 12 36 36 36 3632 32 32 32 32 3212 12 12 12 12 128 8 8 8 8 820 20 20 20 20 2010 10 10 10 10 10

ta set 1a, with data set 1a–5a and with data set 1a–15a.

nd cross validation 3rd cross validation 4th cross validation

7.62 97.62 94.056.43 97.62 91.672.86 91.67 91.671.83 92.57 90.591.09 91.58 92.579.85 92.08 90.843.97 84.88 83.062.97 85.30 83.972.23 82.23 80.98

Page 6: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

Tabl

e3

Clas

sifi

cati

onra

teon

sign

er-d

epen

dent

,sig

ner-

sem

i-in

depe

nden

tan

dsi

gner

-ind

epen

dent

from

the

mod

els

trai

ned

by1a

,tra

ined

by1a

–5a,

and

trai

ned

by1a

–15a

.

Trai

ned

by1a

Trai

ned

by1a

–5a

Trai

ned

by1a

–15a

Dat

ase

tSI

FTth

resh

old

Dat

ase

tSI

FTth

resh

old

Dat

ase

tSI

FTth

resh

old

Dat

ase

tSI

FTth

resh

old

Dat

ase

tSI

FTth

resh

old

Dat

ase

tSI

FTth

resh

old

0.65

0.7

0.75

0.65

0.7

0.75

0.65

0.7

0.75

0.65

0.7

0.75

0.65

0.7

0.75

0.65

0.7

0.75

1a99

.40

99.4

098

.21

10a

24.6

926

.25

31.2

51a

99.4

099

.40

99.4

010

a51

.25

49.6

957

.50

1a10

0.00

99.4

099

.70

10a

76.5

679

.06

77.5

01b

89.1

795

.00

89.1

710

b23

.75

31.2

533

.75

1b95

.00

93.3

394

.17

10b

40.0

040

.00

51.2

51b

94.1

792

.50

92.5

010

b63

.75

67.5

065

.00

2a61

.88

63.7

564

.06

11a

61.2

563

.13

56.5

62a

96.8

897

.50

96.5

611

a65

.63

67.1

966

.25

2a94

.69

94.0

689

.06

11a

81.5

679

.38

76.2

52b

61.2

556

.25

56.2

511

b58

.75

60.0

052

.50

2b77

.50

80.0

077

.50

11b

56.2

565

.00

60.0

02b

87.5

082

.50

80.0

011

b73

.75

76.2

573

.75

3a64

.06

72.1

959

.06

12a

44.0

651

.25

45.6

33a

96.5

695

.00

95.9

412

a62

.81

61.5

666

.56

3a90

.94

90.6

390

.94

12a

71.8

877

.19

79.3

83b

60.0

060

.00

46.2

512

b40

.00

50.0

047

.50

3b96

.25

93.7

593

.75

12b

58.7

558

.75

67.5

03b

92.5

088

.75

92.5

012

b78

.75

75.0

080

.00

4a52

.81

49.0

652

.19

13a

31.8

838

.75

35.0

04a

94.3

890

.31

85.0

013

a45

.63

47.5

045

.31

4a84

.06

78.4

474

.69

13a

78.7

580

.94

80.0

04b

48.7

556

.25

45.0

013

b27

.50

35.0

020

.00

4b91

.25

91.2

578

.75

13b

40.0

038

.75

42.5

04b

83.7

580

.00

78.7

513

b75

.00

70.0

073

.75

5a55

.63

60.3

150

.63

14a

55.6

358

.13

54.3

85a

92.8

193

.44

91.2

514

a67

.50

65.9

472

.81

5a88

.13

89.3

887

.50

14a

85.6

388

.13

85.9

45b

51.2

556

.25

55.0

014

b45

.00

43.7

542

.50

5b91

.25

91.2

586

.25

14b

60.0

060

.00

63.7

55b

83.7

591

.25

86.2

514

b77

.50

81.2

581

.25

6a40

.63

38.4

431

.56

15a

42.8

148

.13

40.0

06a

46.2

546

.88

52.8

115

a76

.88

74.6

974

.06

6a73

.75

74.6

974

.69

15a

88.7

589

.69

85.0

06b

36.2

535

.00

32.5

015

b42

.50

45.0

043

.75

6b42

.50

47.5

045

.00

15b

72.5

073

.75

77.5

06b

75.0

075

.00

73.7

515

b82

.50

86.2

586

.25

7a58

.13

69.3

862

.50

1643

.50

46.5

042

.50

7a74

.69

78.4

474

.38

1654

.00

61.5

065

.00

7a95

.63

93.7

589

.06

1673

.00

69.0

068

.50

7b60

.00

73.7

566

.25

1760

.00

65.5

056

.50

7b68

.75

80.0

082

.50

1778

.00

84.5

086

.50

7b93

.75

98.7

592

.50

1791

.50

90.0

089

.50

8a54

.38

65.6

367

.81

1838

.00

39.5

041

.50

8a76

.88

76.8

881

.25

1854

.50

57.0

073

.00

8a91

.25

89.0

687

.81

1861

.50

66.5

076

.50

8b61

.25

70.0

071

.25

1946

.00

49.5

044

.50

8b76

.25

82.5

083

.75

1968

.00

66.5

065

.50

8b88

.75

92.5

081

.25

1981

.00

78.0

075

.50

9a29

.06

28.7

526

.56

2039

.00

42.0

044

.00

9a47

.19

50.6

352

.81

2054

.00

58.0

062

.00

9a77

.19

79.3

875

.00

2075

.00

73.0

060

.00

9b31

.25

31.2

525

.00

9b53

.75

52.5

055

.00

9b73

.75

75.0

071

.25

17b

78.0

058

.00

50.0

019

b70

.00

46.0

034

.00

1296 S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298

the word ‘‘understand’’ are ‘‘m’’, ‘‘mn1’’, ‘‘mn2’’, and ‘‘nh’’. Compar-ing our left–right model to the fully connected left–right Bakismodel (Rabiner, 1989; Rabiner and Juang, 1993), they are exactlythe same in the 3-state model. However, in the 4- and 5-state mod-els, there are 1 and 3 missing transitions, respectively. The missingtransition in the 4-state model is that from states 1 to 4. Whereasthe missing transition in the 5-state model are the transition fromstate 1 to states 4 and 5, and from states 2 to 5. The reason we se-lect this model is that each sign will have at least three states tocope with the motion.

In this work, we divide the experiment into three parts, i.e.,training with 1a, training with 1a–5a and training with 1a–15a.We implement the 4-fold cross validation in all of the experiments.Table 2 shows the correct classification of the validation set whentraining with 1a. The HMMs from the 1st cross validation are se-lected because they give the best result on the validation set. Sincethere are many confusion matrices, we decide to show only theclassification rates in all of the experiments. The classification re-sult of the data sets 1a and 1b (signer-dependent), 2a–5a (signer-semi-independent), 2b–5b (signer-semi-independent), 6a–15a(signer-independent), 6b–15b (signer-independent) and 16–20(signer-independent) are shown in Table 3. From the table we cansee that only the blind test result from signer-dependent is around90%. The classification results of the blind test data set from signer-semi-independent are approximately 50%. The blind test resultsfrom the signer-independent are mostly below 50% except that ofsubjects 7, 8 and 17 that are around 70%.

Then we implement 4-fold cross validation on the data sets 1a–5a. The classification rates of the validation set are also shown inTable 2. We can see that on the average the 3rd cross validationimplementation gives the best classification rate. Hence, its HMMsare selected. From the classification rate also shown in Table 3, wecan see that the average classification rates of the blind test datasets of signer-dependent (subjects 1–5) are 90.68%, 90.23% and86.82% with 0.65, 0.7 and 0.75 SIFT threshold, respectively.Whereas the average classification rates of the blind data sets ofsigner-independent (subjects 6 to 15) are 56.88%, 59.88% and62.88% with 0.65, 0.7 and 0.75 SIFT threshold, respectively, andthat of signer-independent (subjects 16–20) are 62.56%, 66.33%and 71.33% with 0.65, 0.7 and 0.75 SIFT threshold, respectively.When we look at the results carefully, we can see that the maxi-mum classification rates of the blind data sets from subject 7, 8and 17 are 82.5%, 83.75% and 86.5% with 0.75 SIFT threshold. Fromthese two experiments, we can see that if we increase the numberof subjects in the HMMs training, there is a chance that the classi-fication rates of all types of signers will also increase.

Therefore, the 4-fold cross validation on the data sets 1a–15a isimplemented. Again from Table 2, the HMMs trained on the 3rdcross validation give the best result on average. We utilize thesemodels on all data sets and the classification rates are also shownin Table 3. We can see that the average classification rates of theblind data sets of signer-dependent (subjects 1–5) with 0.65, 0.7and 0.75 SIFT threshold are 88.86%, 87.5% and 86.59%, respectively,whereas that of signer-semi-dependent (subjects 6 to 15) are78.25%, 79.75% and 77.88%, respectively. The average classificationrates of the blind data sets of signer-independent (subjects 16–20)are 76.56%, 75.56% and 75.56% with 0.65, 0.7 and 0.75 SIFT thresh-old, respectively. Again, the maximum classification rates of blinddata sets from subjects 7 and 8 (signer-semi-independent) are98.75% and 92.5% with 0.7 SIFT threshold, whereas that of subject17 is 91.5% with 0.65 SIFT threshold.

Although, the average classification results of the signer-semi-independent and sign-dependent are not quite good as that ofthe signer-dependent, the classification rates from some subjectsin each category are above 90%. This may be because there is nomistake when these subjects make the sign.

Page 7: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

Table 4Indirect comparison with other systems.

Instrument used Pre-process withsegmentation

Mode #. ofsigners

Recognition rate

ArSL None: free hands Yes Offline, signer-dependent 18 97.4%ArSL None: free hands Yes Online, signer-independent 18 90.6%CSL 18-sensor, data gloves and two position

trackersN/A Online, signer-independent 6 80%

TFSL1 None: free hands No Offline, signer-dependent 2 60–80%TFSL1 None: free hands No Offline, signer-independent 2 20–40%TFSL2 None: free hands Yes Offline, signer-dependent 2 72%Proposed

methodNone: free hands No Offline, signer-dependent 5 86–95% (on

average)Proposed

methodNone: free hands No Offline, signer-semi-

independent10 79.75% (on

average)Proposed

methodNone: free hands No Offline, signer-independent 5 76.56% (on

average)

S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298 1297

For the sake of curiosity, we also implement the system trainedwith 1a–15a on subjects 17 and 19 (called 17b and 19b, respec-tively) with various complex natural backgrounds. They are askedto wear a short-sleeve shirt and stand in front of natural back-grounds while they perform each sign. Each sign is performed fivetimes for each subject and is recorded in different places and times.The results are also shown in Table 3. We found that the best cor-rect classification rates on both subjects are 78% and 70% for sub-jects 17b and 19b, respectively, at 0.65 SIFT threshold. Withcomplex background, the SIFT might match keypoints incorrectlyas shown in Fig. 2(h). Our algorithm is able to find the correct sym-bol using Eq. (4) even with some mismatched keypoints. However,if there are too many mismatched keypoints, the algorithm cannotfind the correct symbol.

From the experiments, we found that the misclassificationmight be because there are some Thai sign language words thathave similar hand gestures, e.g., ‘‘woman’’ and ‘‘miss’’. When wecompute O and PðOjkcÞ, there might be a chance that the modelthat gives the maximum value is not the right one. For the ‘‘miss’’HMM, the observation sequence should be {‘‘nh’’, ‘‘nh’’, ‘‘nh’’, ‘‘nh’’,‘‘nh’’, ‘‘hi1’’, ‘‘hi1’’, ‘‘hi2’’, ‘‘hi2’’, ‘‘i’’, ‘‘i’’, ‘‘i’’, ‘‘i’’, ‘‘i’’}. For the ‘‘wo-man’’ HMM, the observation sequence should be {‘‘nh2’’, ‘‘nh2’’,‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’,‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’} because the hand gestures of this word arealmost unchanged for the sign language. However, there is achance that frames in ‘‘woman’’ are matched to symbol ‘‘nh’’instead of ‘‘nh2’’. The observation sequence of an example of‘‘woman’’ sign that is misclassified are {‘‘nh’’, ‘‘nh’’, ‘‘nh’’, ‘‘nh2’’,‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh2’’, ‘‘nh’’, ‘‘nh’’,‘‘nh2’’}, hence, the output from the HMMs will give a maximumvalue from the ‘‘miss’’ model instead.

Another reason is that there are some hand gestures in the sig-nature library that are very similar. For example, the hand gesturesfor symbols ‘‘g’’, ‘‘g2’’ and ‘‘k’’ shown in Fig. 1(a) are very similar,although, the difference between these symbols are the positionof the hand (close to body, away from body, or straight). Whensubjects make sign language words that use these symbols, theysometimes tend to make similar hand gestures for these symbols.

To our knowledge, there are two research works involvingThai finger-spelling sign language translation system (TFSL)Sriboonruang et al., 2004; Phitakwinai et al., 2008. The indirectcomparison can be done between these two methods and ourtranslation system. In addition, the indirect comparison can bedone between the translation system developed for Chinese SignLanguage (CSL) Gao et al., 2004, the one developed for Arabic SignLanguage (ArSL) Al-Rouson et al., 2009 and our system. In Table 4,we show the general comparison between our system with the twosystems for TFSL namely TFSL1 (Phitakwinai et al., 2008) andTFSL2 (Sriboonruang et al., 2004), CSL (Gao et al., 2004) and ArSL

(Al-Rouson et al., 2009). Our system yields a pretty good result thatis comparable with ArSL for signer-dependent. Although for signer-independent, our system is not as good as ArSL we do not need todo any pre-processing or segmentation before computing thetranslation words. It is also hard to compare the difficulty levelsbetween different data sets because we have no knowledge of handsign similarity among words in the data set that ArSL was appliedto. However, ArSL applied the discrete cosine transform (DCT) toextract features from each frame. It is hard to verify how ArSLcould extract information of hands and arms in the spatial domain.It is also not clear whether the information of fingers was takeninto account. That might be the reason that it was mentioned inAl-Rouson et al. (2009) that the errors of their method were amongthe gestures ‘‘home’’, ‘‘ate’’, ‘‘wake up’’, and ‘‘sniff’’. These four ges-tures have similar positions of arms and hands but fingers’ shapesand positions are different. If the information of fingers were takeninto account, then the errors should be reduced. In contrast, ourproposed method looks into very details of hands including the fin-gers. The bottom line is that it is extremely difficult to comparerecognition performance achieved from different data sets. In handsign language, it is even more difficult to compare because thereare several languages. The hand gesture in a language could be dif-ferent from the other languages to represent the same word. Theselection of a set of words is also an important issue. A set withsimilar hand gestures for different words will definitely be moredifficult than that with different gestures.

4. Conclusions

In this work, we build an automatic Thai sign language transla-tion system using Scale Invariant Feature Transform (SIFT) andHidden Markov Models (HMMs). In particular, we decimate a videosequence into 14 image frames. We then find a matched observa-tion symbol from the keypoints descriptors in the signature librarydatabase collected from five subjects at different times of day andfor several days using the SIFT algorithm. We found that the bestresult of the blind data sets of signer-dependent is in between86% and 95% on average and the average of that of signer-semi-independent (same subjects used in the HMM training) is around79.75%. Whereas the best average classification rate of the blinddata sets of signer-independent is 76.56%. This system, however,does not need any segmentation technique before translating eachvideo sequence into a word. The important feature of the proposedsystem is that it does not consider only the position of hands, butalso considers the shape and position of fingers which is not thecase in the previously proposed methods. This allows the systemto be able to recognize the hand sign words that have similar ges-tures which might not be achieved using the existing methods.

Page 8: Thai sign language translation using Scale Invariant Feature Transform and Hidden Markov Models

1298 S. Auephanwiriyakul et al. / Pattern Recognition Letters 34 (2013) 1291–1298

However, it is more natural in the application to not constrain anycloth or background. Hence, we further implement the best systemon two signers from the signer-independent set (subjects 17 and19) who are asked to wear a short-sleeve shirt and stand in frontof various complex backgrounds. The best correct classificationrate in this case is around 74% on the average. In our future work,we will test the system with more signs.

Acknowledgments

This research is supported by the Telecommunications Researchand Industrial Development Institute under the contracts Nos. TRI-DI RS. 2551/003 and TRIDI LAB 2551/001. We would like to thankthe teachers and the students at Anusarnsunthorn School for theDeaf in Chiang Mai, Thailand for the help and advice in Thai signlanguage.

References

Alon, J., Athisos, V., Yuan, Q., Sclaroff, S., 2005. Simultaneous localization andrecognition of dynamic hand gestures. In: Proc. IEEE Workshop on Motion andVideo Computing, pp. 254–260.

Al-Rouson, M., Assaleh, K., Tala’a, A., 2009. Video-based signer-independent Arabicsign language recognition using hidden Markov models. Appl. Soft Comput. 9,990–999.

Auephanwiriyakul, S., Chaisatian, P., 2004. Static hand gesture translation usingstring grammar hard C-means. In: The Fifth International Conference onIntelligent Technologies, Houston, Texas, USA.

Chen, L.Y., Mizuno, N., Fujimoto, H., Fujimoto, M., 2002. Hand shape recognitionusing the bending element of fingers in sign language. Trans. Japan Soc. Mech.Eng. Part C 68 (12), 3689–3696.

Chen, F.S., Fu, C.M., Huang, C.L., 2003. Hand gesture recognition using a real-timetracking method and hidden Markov models. Image Vision Comput. 21 (8),745–758.

Fels, S.S., Hinton, G.E., 1993. Glove-talk: a neural network interface between a data-glove and a speech synthesizer. In: IEEE Trans. Neural Networks 4 (1), 2–8.

Gamage, N., Kuang, Y.C., Akmeliawati, R., Demidenko, S., 2011. Gaussian processdynamical models for hand gesture interpretation in sign language. PatternRecognit. Lett. 32 (15), 2009–2014.

Gao, W., Chen, X.L., Ma, J.Y., Wang, Z.Q., 2000. Building language communicationbetween deaf people and hearing society through multimodal human–computer interface. Chin. J. Comput. 23 (12), 1253–1260.

Gao, W., Fang, G., Zhao, D., Chen, Y., 2004. A Chinese sign language recognitionsystem based on SOFM/SRN/HMM. Pattern Recognit. 37 (12), 2389–2402.

Han, J., Awad, G., Sutherland, A., 2009. Modelling and segmenting subunits for signlanguage recognition based on hand motion analysis. Pattern Recognit. Lett. 30(6), 623–633.

Huang, C.L., Huang, W.Y., 1998. Sign language recognition using model-basedtracking and a 3D Hopfield neural network. Mach. Vision Appl. 10 (5–6), 292–307.

Ibarguren, A., Maurtua, I., Sierra, B., 2012. Layered architecture for real time signrecognition: hand gesture and movement. Eng. Appl. Artif. Intell. 23, 1216–1228.

Jiang, H., Helal, A., Elmagarmid, A.K., Joshi, A., 1998. Scene change detectiontechniques for video database systems. Multimedia Syst. 6 (3), 186–195.

Just, A., Marcel, S., 2009. A comparative study of two state-of-the-art sequenceprocessing techniques for hand gesture recognition. Comput. Vision ImageUnderstanding 113 (4), 532–543.

Kelly, D., McDonald, J., Markham, C., 2012. A person independent system forrecognition of hand postures used in sign language. Pattern Recognit. Lett. 31(11), 1359–1368.

Kobayashi, T., Haruyama, S., 1997. Partly-hidden Markov model and its applicationto gesture recognition. In: IEEE International Conference on Acoustics, Speechand Signal Processing – Proceedings (ICASSP), vol. 4, pp. 3081–3084.

Kramer, J., Leifer, L., 1989. The talking glove: a speaking aid for nonvocal deaf anddeaf-blind individuals. In: Proc. of RESNA 12th Annual Conference, pp. 471–472.

Lee, Y.H., Tsai, C.Y., 2009. Taiwan sign language (TSL) recognition based on 3D dataand neural networks. Expert Syst. Appl. 36 (2, Part 1), 1123–1128.

Lowe, D.G., 2004. Distinctive image features from scale-invariant keypoints. Int. J.Comput. Vision 60 (2), 91–110.

Min, B.W., Yoon, H.S., Soh, J., Yang, Y.M., Ejima, T., 1997. Hand gesture recognitionusing hidden Markov models. IEEE Int. Conf. Syst. Man Cybern. 5, 4232–4235.

Office of the Basic Education Commission, 1997. Thai Hand Sign LanguageHandbook under the Initiatives of Her Royal Highness Princess Maha ChakriSirindhorn, Bangkok, Thailand. (In Thai).

Oz, C., Leu, M.C., 2011. American sign language word recognition with a sensoryglove using artificial neural networks. Eng. Appl. Artif. Intell. 24, 1204–1213.

Phitakwinai, S., Auephanwiriyakul, S., Theera-Umpon, N., 2008. Thai sign languagetranslation using fuzzy C-means and scale invariant feature transform. LectureNotes Comput. Sci. 5073, 1107–1119.

Prisacariu, V.A., Reid, I., 2012. 3D hand tracking for human computer interaction.Image Vision Comput. 30, 236–250.

Rabiner, L., 1989. A tutorial on hidden Markov models and selected application inspeech recognition. Proc. IEEE 77 (2), 257–286.

Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition. Prentice Hall,Englewood Cliffs NJ.

Shen, X., Hua, G., Williams, L., Wu, Y., 2012. Dynamic hand gesture recognition: anexemplar-based approach from motion divergence fields. Image Vision Comput.30, 227–235.

Sriboonruang, Y., Kumhom, P., Chamnongthai, K., 2004. Hand posture classificationusing wavelet moment invariant. In: IEEE International Conference on VirtualEnvironments, Human–Computer Interfaces and Measurement Systems, pp.78–82.

Stergiopoulou, E., Papamarkos, N., 2009. Hand gesture recognition using a neuralnetwork shape fitting technique. In: Eng. Appl. Artif. Intell. 22, 141–1158.

Su, M.C., Jean, W.F., Chang, H.T., 1996. A static hand gesture recognition systemusing a composite neural network. In: Proceedings of the Fifth IEEEInternational Conference on Fuzzy Systems, pp. 786–792.

Wang, Q., Chen, X., Zhang, L.G., Wang, C., Gao, W., 2007. Viewpoint invariant signlanguage recognition. Comput. Vision Image Understanding 108 (1–2), 87–97.

Wu, J.Q., Gao, W., Song, Y., Liu, W., Pang, B., 1998. Simple sign language recognitionsystem based on data glove. In: International Conference on Signal ProcessingProceedings (ICSP), vol. 2, pp. 1257–1260.

Wu, J.Q., Gao, W., Chen, X.L., Ma, J.Y., 2000. Hierarchical DGMM recognizer forChinese sign language recognition. J. Software 11 (11), 1430–1439.

Yang, R., Sarkar, S., 2009. Coupled grouping and matching for sign and gesturerecognition. Comput. Vision Image Understanding 113 (6), 663–681.