[IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India...

5
Gesture Recognition for American Sign Language with Polygon Approximation Geetha M 1 , Rohit Menon 2 , Suranya Jayan, Raju James, Janardhan G.V.V Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham [email protected] 1 , [email protected] 2 Abstract— We propose a novel method to recognize symbols of the American Sign Language alphabet (A-Z) that have static gestures. Many of the existing systems require the use of special data acquisition devices like data gloves which are expensive and difficult to handle. Some of the methods like finger tip detection do not recognize the alphabets which have closed fingers. We propose a method where the boundary of the gesture image is approximated into a polygon with Douglas–Peucker algorithm. Each edge of the polygon is assigned the difference Freeman Chain Code Direction. We use finger tips count along with difference chain code sequence as a feature vector. The matching is done by looking for either perfect match and in case there is no perfect match, substring matching is done. The method efficiently recognizes the open and closed finger gestures. Keywords- ASL, chain code, gesture recognition, Polygon approximation I. INTRODUCTION Sign language is a widely used and accepted standard for communication by people with hearing and speaking impairments. It utilizes hand and body gestures for communication. We call a symbol that is represented by a static gesture as static symbol; all symbols except J and Z are static symbols. We call the number of open fingers in a gesture as the finger count. Our proposed system recognizes and translates static alphabets in ASL into textual output. This text can further be converted into speech. The system provides an opportunity for deaf and dumb individuals to communicate and learn using computers. We use the concepts of digital image processing on the static gesture image of a sign language symbol. The feature vectors thus extracted from the image are used to classify the image into one of the ASL symbols . Many of the existing systems require the person gesticulating to use special data acquisition devices like data gloves [1] which are expensive and difficult to handle. Our paper proposes a more flexible vision based approach where the person is free from additional equipment. Our approach makes use of finger tip detection and polygon approximation for sign language translation. Canny Edge Detection algorithm is used to detect the edges present in the image. Then we trace the boundary of the resulting image to find the finger tips, which gives the finger count. By using this method we can recognize gestures having open fingers. In our proposed method, edge detected image is also subjected to polygon approximation, which represents the hand gesture as a polygon. A difference chain code sequence is generated for the polygon by assigning the Freeman chain code to each of the polygon edges. The finger count and chain code sequence together forms the feature vector which is given to the neural network based classifier. II. BACKGROUND There have been several papers that attempted this problem for various languages [10,11,12,13,14,15]. Specifically, [2] used finger tip detection. In [2] edge detection algorithm (Canny edge operator) and boundary tracing are used. Hand gestures are recognized automatically using the data such as the shape and the kinematics of the compressed arm trajectories [2]. The hand is detected using attributes like its motion and the skin color [3]. Hand shape estimation under complex backgrounds is done by adding the models having only the position and velocity of the hand [4]. The image of the hand gesture is captured and converted into feature vectors [5]. The hand gesture input is taken with the help of a data glove and artificial neural networks are used to recognize the gesture [6]. Hand gestures are represented in terms of hierarchies of multi scale color images [7]. In some systems more than one feature extraction methods and neural networks are implemented to recognize the gestures made by hand [8]. III. PROPOSED METHOD Our approach is robust and efficient for static symbol recognition and translation of ASL. The proposed method has two main phases for feature extraction. The first phase obtains a finger count similar to [5]. The second phase extracts the shape of the gesture by the method of polygon approximation. The first phase helps in identifying the gestures which have open fingers. Since some of the gestures do not have any open fingers, we need a second phase to improve on the feature set by the method of polygon approximation. First we convert our static RGB image into a gray scale image. Then canny edge detection algorithm is applied to this grayscale image for detecting the edges. Clipping is done to the resulting image in order to retain only the palm. Boundary tracing is done on the resulting image and fingertips are detected to obtain finger count. Then we perform polygon approximation to the edge detected image. This results in a polygon whose edges are assigned with chain codes. The chain codes and the finger count is considered as the feature vector. 2011 International Conference on Technology for Education 978-0-7695-4534-9/11 $26.00 © 2011 IEEE DOI 10.1109/T4E.2011.48 241 2011 IEEE International Conference on Technology for Education 978-0-7695-4534-9/11 $26.00 © 2011 IEEE DOI 10.1109/T4E.2011.48 241

Transcript of [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India...

Page 1: [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India (2011.07.14-2011.07.16)] 2011 IEEE International Conference on Technology for Education

Gesture Recognition for American Sign Language with Polygon Approximation

Geetha M1, Rohit Menon2, Suranya Jayan, Raju James, Janardhan G.V.V

Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham [email protected], [email protected]

Abstract— We propose a novel method to recognize symbols of the American Sign Language alphabet (A-Z) that have static gestures. Many of the existing systems require the use of special data acquisition devices like data gloves which are expensive and difficult to handle. Some of the methods like finger tip detection do not recognize the alphabets which have closed fingers. We propose a method where the boundary of the gesture image is approximated into a polygon with Douglas–Peucker algorithm. Each edge of the polygon is assigned the difference Freeman Chain Code Direction. We use finger tips count along with difference chain code sequence as a feature vector. The matching is done by looking for either perfect match and in case there is no perfect match, substring matching is done. The method efficiently recognizes the open and closed finger gestures.

Keywords- ASL, chain code, gesture recognition, Polygon approximation

I. INTRODUCTION Sign language is a widely used and accepted standard for

communication by people with hearing and speaking impairments. It utilizes hand and body gestures for communication.

We call a symbol that is represented by a static gesture as static symbol; all symbols except J and Z are static symbols. We call the number of open fingers in a gesture as the finger count. Our proposed system recognizes and translates static alphabets in ASL into textual output. This text can further be converted into speech. The system provides an opportunity for deaf and dumb individuals to communicate and learn using computers. We use the concepts of digital image processing on the static gesture image of a sign language symbol. The feature vectors thus extracted from the image are used to classify the image into one of the ASL symbols.

Many of the existing systems require the person gesticulating to use special data acquisition devices like data gloves [1] which are expensive and difficult to handle. Our paper proposes a more flexible vision based approach where the person is free from additional equipment.

Our approach makes use of finger tip detection and polygon approximation for sign language translation. Canny Edge Detection algorithm is used to detect the edges present in the image. Then we trace the boundary of the resulting image to find the finger tips, which gives the finger count. By using this method we can recognize gestures having open fingers. In our proposed method, edge detected image is also subjected to polygon approximation, which represents the hand gesture as a

polygon. A difference chain code sequence is generated for the polygon by assigning the Freeman chain code to each of the polygon edges. The finger count and chain code sequence together forms the feature vector which is given to the neural network based classifier.

II. BACKGROUND There have been several papers that attempted this problem

for various languages [10,11,12,13,14,15]. Specifically, [2] used finger tip detection. In [2] edge detection algorithm (Canny edge operator) and boundary tracing are used. Hand gestures are recognized automatically using the data such as the shape and the kinematics of the compressed arm trajectories [2]. The hand is detected using attributes like its motion and the skin color [3]. Hand shape estimation under complex backgrounds is done by adding the models having only the position and velocity of the hand [4]. The image of the hand gesture is captured and converted into feature vectors [5]. The hand gesture input is taken with the help of a data glove and artificial neural networks are used to recognize the gesture [6]. Hand gestures are represented in terms of hierarchies of multi scale color images [7]. In some systems more than one feature extraction methods and neural networks are implemented to recognize the gestures made by hand [8].

III. PROPOSED METHOD Our approach is robust and efficient for static symbol

recognition and translation of ASL. The proposed method has two main phases for feature extraction. The first phase obtains a finger count similar to [5]. The second phase extracts the shape of the gesture by the method of polygon approximation. The first phase helps in identifying the gestures which have open fingers. Since some of the gestures do not have any open fingers, we need a second phase to improve on the feature set by the method of polygon approximation.

First we convert our static RGB image into a gray scale image. Then canny edge detection algorithm is applied to this grayscale image for detecting the edges. Clipping is done to the resulting image in order to retain only the palm. Boundary tracing is done on the resulting image and fingertips are detected to obtain finger count. Then we perform polygon approximation to the edge detected image. This results in a polygon whose edges are assigned with chain codes. The chain codes and the finger count is considered as the feature vector.

2011 International Conference on Technology for Education

978-0-7695-4534-9/11 $26.00 © 2011 IEEE

DOI 10.1109/T4E.2011.48

241

2011 IEEE International Conference on Technology for Education

978-0-7695-4534-9/11 $26.00 © 2011 IEEE

DOI 10.1109/T4E.2011.48

241

Page 2: [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India (2011.07.14-2011.07.16)] 2011 IEEE International Conference on Technology for Education

Fig. 1. Overall scheme of the proposed method

A. Edge Detection & Clipping

1) Canny Edge Detection

Edge detection is the process of marking the boundaries of an object where there is strong variation in the pixel intensities of the image.

Canny edge detector is known as the optimal edge detector. Edge detection is achieved by following a set of criteria like low error rate, localized edge points, minimum distance between the actual edges and detected edges and having only one response to a single edge. This is done using the first derivative of a Gaussian filter which uses two dimensional Gaussian function and the function is as follows

(1) Its grads vector is

(2) In the above formulae

Where k is a constant and σ is parameter of the Gaussian filter that controls the extent of smoothing image.

First step in Canny algorithm is to filter out any noise in the original image and this is done using a Gaussian filter with a simple convolution mask. After smoothening and filtering out any noise, the next step is finding the edge strength using a gradient operator, in this case a Sobel operator is used. Sobel operator uses two 3×3 convolution masks for finding gradients in x and y directions.

Fig 2.Kernel for Sobel operator

The edge strength is = and the edge direction is found out using theta = )

Fig 3: Canny edge detected binary image

Once the edge direction is known it is resolved into one of the four directions (0, 45, 90 and 135) that can be traced in the image. The next step is called the non-maximum suppression where the edge is traced and any pixel outside the edge is set to 0, giving a thin line as output. Finally, hysteresis thresholding with a high threshold h and a low threshold l is performed. Any pixel with value greater than h is selected as an edge; also any pixel connected to a selected pixel having a value greater than l is also selected as an edge.

2) Clipping

Clipping [2] is done to eliminate the regions of the edge detected image that are not needed for identifying the gesture.

Fig 4: Image after clipping

Feature Vector

Matching

Finger Tip Counting Chain Coding

Boundary Tracing

Finger Tip detection

Boundary Tracing

Polygon Approximation

Chain Code Calculation

Input Image

Hand detection

Edge detection

Clipping

242242

Page 3: [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India (2011.07.14-2011.07.16)] 2011 IEEE International Conference on Technology for Education

This is done by finding the corresponding y value (row) from which the clipping should start. Y1 is chosen to be the row where five or more consecutive white pixels are first detected starting from the very last row. Similarly to find Y2, it is assumed that the edges representing the wrist portion of the hand are of equal length, and any row where there is a drastic change in the length is noted to be the end of the wrist portion and the beginning of the palm and noted as Y2. The image is clipped from the maximum of Y1 and Y2 for better results.

B. Boundary Tracing and Fingertip Detection As defined earlier, finger count stands for the number of

open fingers in a gesture. We obtain the finger count by performing boundary tracing.

One is interested only in the finger region of the hand and palm region can be discarded. Thus, a minimum y value is to be found from where the boundary tracing can start [2]. Optimal value for y is found out to start in the region R of the edge detected palm image. R spans the image where the height is 30% - 35 % from the bottom.

Starting from the optimal y level, the first pixel with white intensity is detected and upward trace is done from that pixel by keeping the x value constant and varying just the y value. In case the pixel is not a boundary, pixels to the left and right are checked for white pixel intensities and the left right flag is set accordingly.

A finger tip is detected when the y value cannot move further upward and has to move downwards to detect the next white pixel. Boundary tracing is done using the following algorithm.

1. Search the image from top left until a pixel (P0) of a new region is found. This pixel P0 is the starting pixel of the region border. It has the minimum column value of all pixels of that region having the minimum row value. We define a variable dir which stores the direction of the previous move along the border from previous border element to the current border element. We assign

a) dir=0 if the border is detected in 4-connectivity.

b) dir=7 if the border is detected in 8-connectivity.

2. Search the 3×3 neighborhood of the current pixel in an anti-clockwise direction, beginning the neighborhood search in the pixel positioned in the direction.

a) (dir + 3) mod 4 b) (dir + 7) mod 8 if dir is even

(dir + 6) mod 8 if dir is odd The first pixel found with the same value as the current pixel is a new boundary element Pn.. Update the dir value.

3. If the current boundary element Pn is equal to the second border element P1, and if the previous border element Pn-1 is equal to P0. Otherwise repeat step (2).

4. The detected pixel borders are represented by the pixels P0……Pn-2.

Fig 5: Finger tip detected

C. Polygon Approximation As discussed before, the implementation works as a

combination of the fingertip detection strategy and chain code matching strategy for gesture translation. The edge detected image contains a large number of vertex points that define the edges of the hand gesture. It would be inefficient in time to process each set of these vertices and perform chain code estimation. Hence we perform polygon approximation whereby the edge detected image is converted into an approximate polygonal shape that carries the shape information. The advantage of this method is that while the number of vertices provided as input to the chain code estimation module is drastically reduced, complexity reduces. Though the method suffers little loss, most of the shape information of the image is preserved.

The method we use here for polygon approximation is Douglas-Peucker algorithm. The algorithm is used for reducing the number of points in a curve that is approximated by a series of points. The time complexity of the algorithm is o(n log N) and O(N2).

The method given in Douglas and Peucker [9] is best described recursively. To approximate the chain from Vi to Vj, start with the line segment ViVj. If the farthest vertex from this segment has distance at most ε, then accept this approximation. Otherwise, split the chain at this vertex and recursively approximate the two pieces.

Given an array of vertices V, the call DP (V; i; j) simplifies the sub chain from Vi to Vj.

Algorithm DP (V; i; j)

1. Find the vertex Vf farthest from the line ViVj .

Let dist be its distance.

2. if dist > ε then

3a. DP (V; i; f) //Split at Vf and approximate

3b. DP (V; f; j) //recursive call

else

4. Output (ViVj) //Use ViVj in approximation

243243

Page 4: [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India (2011.07.14-2011.07.16)] 2011 IEEE International Conference on Technology for Education

Fig 6: Curve approximated to polygon edges

D. Chain Coding Chain Coding is a method of shape representation of polygons. Chain codes are used to represent the boundary of an object composed of pixels of regular cells by connected sequence of straight-line segments of specified length and direction. The object is traversed in clockwise manner. As the boundary is traversed, the direction of each chain segment is specified using the following numbering scheme:

Fig 7: Chain coding Direction

The advantage of chain code is that it is translation invariant. Another advantage is that it preserves the information of interest, and permits compact storage. Freeman Chain Code Direction takes into account the boundary of an object where direction changes from one edge to another, where a corner is formed. In order to make the sequence scale invariant, difference chain code is taken as feature vector.

The approximated polygon image of the gesture is given as input to the chain coding module which computes the chain code for each of the polygon edges. The method starts by scanning the image to find the first white pixel of the object. From that pixel, we traverse the approximated polygon boundary and decide on the directions of each edge and save them as an array or list. This step is repeated until we reach the starting edge. The output of the module is the chain code which represents the shape of the gesture. The difference of the consecutive chain codes are calculated and the obtained sequence is used as the feature vector.

E. Matching The feature vectors given as input to the network include the finger count and the difference chain code sequence obtained by traversing the approximated polygon in the clockwise direction of edges. The matching is done by looking for either

perfect match and in case there is no perfect match, substring matching is done where not more than one error occurs.

IV. EXPERIMENTAL RESULTS The proposed method recognized the gestures with open fingers as well as closed fingers. The finger count was the basis for recognizing the gestures with open fingers and the chain codes were crucial for detecting closed finger gestures. Out of all the 24 static hand gestures (A- Z excluding J and Z) our system can successfully recognize all 22 hand gestures (excluding A and M). Alphabets A and M were not recognized properly by the method due to the identical exterior boundary shape of these two gestures. By increasing the extent of polygon approximation these two gestures can be identified. The alphabets J and Z could not be recognized by the method since it involved actions which needs motion tracking.

Fig 8: Gestures for ASL alphabets

Fig 9: Screen shot for polygon approximation

S. No Alphabets Recognition Rate (%)

1 A 0% 2 B 100% 3 C 90% 4 D 85% 5 E 100% 6 F 100% 7 G 90% 8 H 90% 9 I 100%

10 K 85% 11 L 100% 12 M 0% 13 N 83% 14 O 85% 15 P 100% 16 Q 100% 17 R 100% 18 S 85%

244244

Page 5: [IEEE 2011 Third International Conference on Technology for Education (T4E) - Chennai, India (2011.07.14-2011.07.16)] 2011 IEEE International Conference on Technology for Education

19 T 85% 20 U 100% 21 V 100% 22 W 100% 23 X 100% 24 Y 100%

Fig 10: Performance evaluation

V. CONCLUSIONS AND FUTURE WORK Our method recognizes all the static symbols of American Sign Language except A and M. Future research can focus on recognizing non static symbols, i.e. J and Z. This will involve capturing the 2D trajectory of the dynamic gesture. Capturing and recognizing gestures corresponding to words is also an interesting open problem. Work can also be extended to cover other aspects of ASL.

REFERENCES [1] Jeonghun Ku, Richard Marz, Nicole Baker, Konstantine k. Zakzanis,

Jang Han Lee, In.Y.Kim, Sun.I.Kim, Simon.J.Graham, “A dataglove with tactile feedback for virtual reality experiments”, Cyber Phsycology and behavior, 2003.

[2] Ravikiran.J, Kavi Mahesh, Suhas Manishi, Dheeraj.R, Sudheendar.S, Nithin.V.Pujari, “On finger detection for sign language recognition”, Hong Kong, March 18,2009.

[3] Sylvie Gibet and Pierre Francois Marteau, “On Approximation of curvature and velocity for gesture segmentation and synthesis”, “Universite de Brertagne” France, 2009.

[4] David Rybach, Prof Dr. J. Brochers, Prof Dr.H. Ney, “Appearance based features for automatic continuous sign language recognition”.

[5] Yasushi Hamada, Nobutaka Shimada and Yoshiaki Shirai, “on Hand shape estimation for complex background images”, Osaka University, Japan, 2004.

[6] Manglai Zhou, “3D model based hand gesture recognition and tracking”, Pami lab, university of Waterloo, December 3,2009.

[7] Lars Bretzner, Ivan Laptev, Toney Lindberg, “on Hand gesture recognition using multiscale colour features hierarchiel models and partial filtering “, CVAP Laboratory, Department of numerical analysis and Computer Science, Sweden,2002.

[8] Vaishali S Kulkarani, ME Digital Systems, S. D. Lokhande, “Appearance based segmentation of sign language using gesture segmentation”Sinhgad College of Engineering, 2010.

[9] David H Douglas, Thomas K Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature”, Cartographica: The international journal for geographic information and Geovisualisation, University of Ottawa/Simon Fraser University, British Columbia 1973.

[10] J. Lee and T. Kunii, “Model-based analysis of hand posture,” IEEE Comput. Graph. Appl., vol. 15, no. 5, pp. 77-86, Sept. 1995

[11] H. Fillbrandt, S. Akyol, K. F. Kraiss, “ Extraction of 3D Hand Shape and Posture from Image Sequences for Sign Language Recognition.” IEEE International Workshop on Analysis and Modeling of Faces and Gestures, vol. 17, pp. 181-186, October 2003.

[12] T. Starner, A. Pentland , J. Weaver, “Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20 issue 12, pp. 1371-1375, December 1998.

[13] C.H. Teh and R. T. Chin, “On image analysis by the methods of moments” , IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 4, pp. 496 – 513, July 1988.

[14] Y. Shirai, N. Tanibata, N. Shimada, “Extraction of hand features for recognition of sign language words,” VI'2002,Computer-Controlled Mechanical Systems, Graduate School of Engineering, Osaka University, 2002.

[15] Y. Hamada, N. Shimada and Y. Shirai, “Hand Shape Estimation under Complex Backgrounds for Sign Language Recognition” , in Proc. of 6th Int. Conf. on Automatic Face and Gesture Recognition, pp. 589-594, May 2004.

245245