178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012...

10
178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 3-D Posture and Gesture Recognition for Interactivity in Smart Spaces Cuong Tran, Student Member, IEEE, and Mohan Manubhai Trivedi, Fellow, IEEE Abstract—Automatic perception of human posture and gesture from vision input has an important role in developing intelligent video systems. In this paper, we present a novel gesture recognition approach for human computer interactivity based on marker-less upper body pose tracking in 3-D with multiple cameras. To achieve the robustness and real-time performance required for practical applications, the idea is to break the exponentially large search problem of upper body pose into two steps: first, the 3-D movements of upper body extremities (i.e., head and hands) are tracked. Then using knowledge of upper body model constraints, these extremities movements are used to infer the whole 3-D upper body motion as an inverse kinematics problem. Since the head and hand regions are typically well defined and undergo less occlusion, tracking is more reliable and could enable more robust upper body pose determination. Moreover, by breaking the problem of upper body pose tracking into two steps, the complexity is reduced considerably. Using pose tracking output, the gesture recognition is then done based on longest common subsequence similarity measurement of upper body joint angles dynamics. In our exper- iment, we provide an extensive validation of the proposed upper body pose tracking from 3-D extremity movement which showed good results with various subjects in different environments. Regarding the gesture recognition based on joint angles dynamics, our experimental evaluation of five subjects doing six upper body gestures with average classification accuracies over 90% indicates the promise and feasibility of the proposed system. Index Terms—Head and hands tracking, human activity anal- ysis, inverse kinematics, smart environment, upper body motion tracking. I. INTRODUCTION AND MOTIVATION Human posture and activity analysis has emerged as an im- portant, interdisciplinary area which has many potential appli- cations such as surveillance [31], advanced human computer interaction [32], intelligent driver assistance systems [7], [29], [30], 3-D animation, health care monitoring, and robot con- trol [9], [15]. Compared to previous technologies using some wearable sensors or markers [9], [15], marker-less vision-based approaches provide more natural and less intrusive solutions which are more convenient for real-world deployments. This is, however, a challenging task due to the exponentially large space Manuscript received March 29, 2011; revised June 22, 2011; accepted September 14, 2011. Date of publication October 19, 2011; date of current version January 20, 2012. This work was supported in part by the University of California Discovery Program and in part by the National Science Foundation. Paper no. TII-11-167. The authors are with the Computer Vision and Robotics Research Labora- tory, University of California, San Diego, CA 92093-0434 USA (e-mail: cu- [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TII.2011.2172450 of possible body poses, the issue of self occlusion as well as variance in human appearance (e.g., different clothes and hair) and lighting conditions. In this paper, we develop a system for human machine interactivity that can recognize human gesture from marker-less multiview input. We choose to deal with the upper body part only since it is simpler than full body, less de- tailed than facial or hand gesture, and yet conveys important information about several human activities where arms carry the most influential information of upper body motion (e.g., in meeting room, teleconference, and driver assistance situations). The novelty of our approach is to develop the very first system (as far as we know) that does both 3-D upper body pose inference from extremity tracking in real-time and then gesture recognition based on pose tracking outputs (i.e., joint angle dynamics). We propose a computational approach for upper body tracking using the 3-D movement of extremities (head and hands) from multiview input, called extremity movement observation (XMOB) upper body tracker (a demonstration of XMOB was shown in [28]). XMOB solves the body pose estimation problem in two parts. The 3-D movements of head and hands are first tracked using multiview input. Then using upper body model constraints, the full upper body motion is inferred based on just the extremity movements as an inverse kinematics problem. The advantages are: first, the complexity is reduced considerably by breaking the exponentially large search problem into two subproblems. Second, the self occlu- sion issue is alleviated because the extremities are the easier parts to track and are rarely occluded even with only two views. Furthermore, XMOB will work as long as the head and hands are observable. It does not matter if the user wears loose clothes or clothes colors are mixed with the background which could be a difficult case for other approaches. Using XMOB upper body pose tracking output, the gesture recognition is done based on longest common subsequence (LCS) similarity measurement. Our experimental results for gesture recognition showed good classification rate (over 90% in average) for six common upper body gestures indicating the advantage and feasibility of developing gesture recognition system based on pose tracking. We have also applied this system to develop an interactive game (see Fig. 1) in which the subject can use some common gestures to interact with the balloons (a supplemental video clip is available 1 ). The remaining sections are organized as follows. In Section II, the related research studies is reviewed. In Section III, the proposed system with XMOB upper body tracker followed by a method for gesture recognition based 1 http://cvrr.ucsd.edu/ctran/Supplements/TII-SpecialIssue-Supplement.avi 1551-3203/$26.00 © 2011 IEEE

Transcript of 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012...

Page 1: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012

3-D Posture and Gesture Recognition forInteractivity in Smart Spaces

Cuong Tran, Student Member, IEEE, and Mohan Manubhai Trivedi, Fellow, IEEE

Abstract—Automatic perception of human posture and gesturefrom vision input has an important role in developing intelligentvideo systems. In this paper, we present a novel gesture recognitionapproach for human computer interactivity based on marker-lessupper body pose tracking in 3-D with multiple cameras. Toachieve the robustness and real-time performance required forpractical applications, the idea is to break the exponentially largesearch problem of upper body pose into two steps: first, the 3-Dmovements of upper body extremities (i.e., head and hands) aretracked. Then using knowledge of upper body model constraints,these extremities movements are used to infer the whole 3-D upperbody motion as an inverse kinematics problem. Since the head andhand regions are typically well defined and undergo less occlusion,tracking is more reliable and could enable more robust upperbody pose determination. Moreover, by breaking the problem ofupper body pose tracking into two steps, the complexity is reducedconsiderably. Using pose tracking output, the gesture recognitionis then done based on longest common subsequence similaritymeasurement of upper body joint angles dynamics. In our exper-iment, we provide an extensive validation of the proposed upperbody pose tracking from 3-D extremity movement which showedgood results with various subjects in different environments.Regarding the gesture recognition based on joint angles dynamics,our experimental evaluation of five subjects doing six upper bodygestures with average classification accuracies over 90% indicatesthe promise and feasibility of the proposed system.

Index Terms—Head and hands tracking, human activity anal-ysis, inverse kinematics, smart environment, upper body motiontracking.

I. INTRODUCTION AND MOTIVATION

Human posture and activity analysis has emerged as an im-portant, interdisciplinary area which has many potential appli-cations such as surveillance [31], advanced human computerinteraction [32], intelligent driver assistance systems [7], [29],[30], 3-D animation, health care monitoring, and robot con-trol [9], [15]. Compared to previous technologies using somewearable sensors or markers [9], [15], marker-less vision-basedapproaches provide more natural and less intrusive solutionswhich are more convenient for real-world deployments. This is,however, a challenging task due to the exponentially large space

Manuscript received March 29, 2011; revised June 22, 2011; acceptedSeptember 14, 2011. Date of publication October 19, 2011; date of currentversion January 20, 2012. This work was supported in part by the University ofCalifornia Discovery Program and in part by the National Science Foundation.Paper no. TII-11-167.

The authors are with the Computer Vision and Robotics Research Labora-tory, University of California, San Diego, CA 92093-0434 USA (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TII.2011.2172450

of possible body poses, the issue of self occlusion as well asvariance in human appearance (e.g., different clothes and hair)and lighting conditions. In this paper, we develop a system forhuman machine interactivity that can recognize human gesturefrom marker-less multiview input. We choose to deal with theupper body part only since it is simpler than full body, less de-tailed than facial or hand gesture, and yet conveys importantinformation about several human activities where arms carrythe most influential information of upper body motion (e.g., inmeeting room, teleconference, and driver assistance situations).

The novelty of our approach is to develop the very firstsystem (as far as we know) that does both 3-D upper body poseinference from extremity tracking in real-time and then gesturerecognition based on pose tracking outputs (i.e., joint angledynamics). We propose a computational approach for upperbody tracking using the 3-D movement of extremities (headand hands) from multiview input, called extremity movementobservation (XMOB) upper body tracker (a demonstrationof XMOB was shown in [28]). XMOB solves the body poseestimation problem in two parts. The 3-D movements of headand hands are first tracked using multiview input. Then usingupper body model constraints, the full upper body motion isinferred based on just the extremity movements as an inversekinematics problem. The advantages are: first, the complexityis reduced considerably by breaking the exponentially largesearch problem into two subproblems. Second, the self occlu-sion issue is alleviated because the extremities are the easierparts to track and are rarely occluded even with only two views.Furthermore, XMOB will work as long as the head and handsare observable. It does not matter if the user wears loose clothesor clothes colors are mixed with the background which couldbe a difficult case for other approaches.

Using XMOB upper body pose tracking output, the gesturerecognition is done based on longest common subsequence(LCS) similarity measurement. Our experimental results forgesture recognition showed good classification rate (over 90%in average) for six common upper body gestures indicating theadvantage and feasibility of developing gesture recognitionsystem based on pose tracking. We have also applied thissystem to develop an interactive game (see Fig. 1) in whichthe subject can use some common gestures to interact with theballoons (a supplemental video clip is available1).

The remaining sections are organized as follows. InSection II, the related research studies is reviewed. InSection III, the proposed system with XMOB upper bodytracker followed by a method for gesture recognition based

1http://cvrr.ucsd.edu/ctran/Supplements/TII-SpecialIssue-Supplement.avi

1551-3203/$26.00 © 2011 IEEE

Page 2: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

TRAN AND TRIVEDI: 3-D POSTURE AND GESTURE RECOGNITION FOR INTERACTIVITY IN SMART SPACES 179

Fig. 1. Interactive game in which user uses some upper body gestures likepointing, punching, clapping to select, pop, and release balloons.

on joint angle dynamics is described in detail. In Section IV,experimental evaluations are provided, and finally, we havesome concluding remarks in Section V.

II. REVIEW OF RELATED RESEARCH STUDIES

A. Human Body Gesture Recognition

Human gesture recognition from vision input is challengingdue to human variations in doing the same gesture (intraclass),e.g., differences in appearance, viewpoint, and execution as wellas the overlap between gesture classes (interclass). Reviews ofrecent research in activity analysis can be found in [12] and[21]. Two main components of an action recognition system arechoosing an action representation (feature) space and then ac-tion classification. It is very important to select a good represen-tation space which should generalize over variations within eachgesture class but still is rich enough to distinguish between dif-ferent classes. Some examples of action representation space in-clude space-time shape, motion history volume [21], cylindricalvoxel histogram [13], and distance transform of body contours[34]. Since extremal parts (i.e., head and hands in case of upperbody) can be extracted more reliably with less occlusion com-pared to other inner parts, there are also several methods usingfeatures based on extremities dynamics, e.g., analyzing handtrajectories and posture [16], or using variable star skeleton rep-resentation [35]. The proposed system is also based on extremi-ties (head and hands) movements for gesture recognition. How-ever, instead of using raw head and hands trajectories, we incor-porate knowledge of the underlying upper body model whichcould help improve gesture recognition. An intuitive and clearway to do so is to implement upper body pose estimation andtracking. The output of body pose tracking such as joint angledynamics are mentioned as rich, view-invariant representationsfor gesture recognition but challenging to derive [21].

B. Human Body Pose Estimation and Tracking

Regarding human pose estimation and tracking, we canloosely categorize related research studies into monocular[6], [19] and multiview approaches [2]–[4], [20], [26], [36].Compared to monocular view, multiview data can help toreduce the self occlusion issue and provide more informationto make the pose estimation task easier as well as to improvethe accuracy. Since estimating the real body pose in 3-D isdesirable, using voxel data reconstructed from multiview inputcan help to avoid the repeated projection of 3-D body modelonto the image planes for comparison and the image scaleissue. These advantages allow the design of simpler algorithms

using human knowledge about shapes and sizes of body parts[20]. In this paper, we also follow a model-based approachusing two-view video input and aim to extract real 3-D posture.As concluded in [24], although human tracking is consideredmostly solved in constrained situations, i.e., has a large numberof calibrated cameras , people wearing tight clothes, anda static environment, there are still remaining key challengesincluding tracking with fewer cameras , dealing withcomplex environments and variations in object appearance(e.g., general clothes, hair, etc.), adapting to different bodyshapes with automatic initialization, and automatically recov-ering from failure. Based on this, Table I shows a summarycomparing to some extent the proposed XMOB system withselected representative model-based methods for human bodypose estimation using multiview data.

Due to the exponentially large search problem of human bodypose estimation and tracking, a common task is figuring out away to search “smartly” for the optimal pose from given imageevidences. Several approaches, e.g., [2], [17], [19] assumesome prior models of motion and/or appearance to reduce thesearch space. The performance of those approaches, however,is limited to the type of motion and appearance in their trainingdata. Mikic et al. [20] use specific information about shape andsize of head and torso to have a hierarchical growing procedurefor body model acquisition and tracking. Bernier et al. [1] usea graphical model to decompose the full 3-D pose state spaceinto individual limb state space. In our XMOB system, wealso try to reduce the complexity by breaking the large searchproblem of upper body tracking into two steps. The motivationcame from research studies in psychophysiology [14] whichshowed human ability in recognizing gesture and activity byonly observing the movement of some “light points” attachedto the body. Later on Soechting and Flanders, researchers inneurophysiology, studied the inverse kinematics of arms andalso found that the desired position of the hand roughly deter-mines the arm posture [25]. They developed the sensorimotortransformation model (STM), which is a set of linear functions,to compute angle parameters of arm pose from known endpoints. Kogay et al. [22] used this STM to implement an inversekinematics algorithm for computer animation of human arms.Nevertheless, since human arms kinematics is redundant, theSTM only provides one among many available solutions. Toovercome this ambiguity, XMOB exploits the “temporal inversekinematics” using observation of extremities dynamics for 3-Dupper body pose tracking instead of inverse kinematics at justa single frame. To some extent, XMOB tries to address someremaining key challenges as concluded in a recent summary of“state-of-the-art” methods for pose and motion estimation [24]including dealing with general loose clothing, recovering fromfailure, and using only two cameras.

III. FRAMEWORK OF THE 3-D POSTURE–GESTURE

RECOGNITION SYSTEM

The flowchart of the proposed system is shown in Fig. 2.From two view input, XMOB first tracks head and hands blobsin 3-D based on robust semisupervised skin color segmenta-tion. Then following a numerical approach, the geometrical

Page 3: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

180 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012

TABLE ICOMPARATIVE SUMMARY OF SELECTED MODEL-BASED METHODS FOR MULTIVIEW BODY POSE ESTIMATION AND TRACKING

constraints of upper body model at each frame is used todetermine a set of hypotheses for possible inner joint locations(shoulders and elbows) from current head and hand positions.By observing head and hand movements over a period of time,XMOB selects the upper body pose sequence that minimizesthe total joint displacement. Although minimizing the totaljoint displacement is a heuristic assumption, our experimentalresults with various subjects in different environments indicatedits feasibility.

Based on the output of XMOB upper body pose tracking, ges-ture classification is done based on LCS similarity measurementof joint angles dynamics. It should be mentioned that contin-uous body movement may contain both gesture movements andnongesture movements. Therefore, the task of gesture spotting(extracting a gesture segment, which will be classified, froma continuous movement sequence) is also important and chal-lenging. In this paper, we have not actually dealt with the gesturespotting issue yet. Regarding our experiment for gesture recog-nition, the subject is required to perform only predetermined

gestures separated by a stop period. Therefore, we can simplysegment the gesture part based on the motion versus nonmotioncue.

XMOB uses a skeletal model for the upper body, as shown inFig. 2. The length of body parts including shoulder line and neckline are considered fixed, which means there are only kinematicmovements at the joints including two shoulder joints, each has3 degrees of freedom (DOF) and two 1 DOF elbow joints. Thereare also physical constraints on shoulder joints and elbow jointswhich limit the possible range of joint angle within a degree-of-freedom. An upper body configuration can be represented bya set of upper body joint and end point positions: ,

, , , , , which can be split into innerjoints and extremal parts

, , , where hea is the Head, lha is theleft hand, rha is the right hand, lsh is the left shoulder, rsh is theright shoulder, leb is the left elbow, and reb is the right elbow).We will discuss in more detail of the framework components inSections IV and V.

Page 4: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

TRAN AND TRIVEDI: 3-D POSTURE AND GESTURE RECOGNITION FOR INTERACTIVITY IN SMART SPACES 181

Fig. 2. Framework for upper body gesture recognition based on 3-D posetracking. As shown, the framework uses a skeletal model of the upper body.

A. Head and Hands Tracking With a SemisupervisedProcedure for Robust Skin Color Segmentation

Using skin color as a cue to track head and hand blobs is quitestraightforward. However, in many cases including our experi-mental data, merely using general skin color model, e.g., [10]is not robust enough for head and hands tracking. In principle,we will achieve more robustness and less complexity when wefocus on our specific case of a particular user’s skin color in aparticular background compared to a general clustering modelfor an arbitrary user’s skin color in an arbitrary background.This can be done manually if at the beginning of each sessionwe have a person manually selecting positive samples of userskin color and negative samples of background colors for thatsession. Our idea is to automate this process by getting somehelp from the interaction with user. We design a simple semisu-pervised procedure in which the user is asked to start by tryingto move only their extremities (head and hands). Combining thedetected motion areas with a general skin color model, we canhave a more specific and robust skin color segmentation model.From head and hand segmentation results, 3-D voxel data ofhead and hand blobs is reconstructed using shape-from-silhou-ette method [4]. Then, 3-D head and hand blobs are tracked withmean shift algorithm.

For automatic initialization of the upper body model (deter-mining the fixed length of the shoulder line, neck line, upperarm, and lower arm), XMOB also asks the user to start withstraight arms, facing forward. Therefore, the arm length can bedetermined and then used to scale up an average body model[18] for initialization.

B. Numerical Method to Predict Inner Joint Sequence FromExtremity Movements

Knowing the estimate of head and hand positions, we want toestimate the remaining inner joint positions as an inverse kine-matics problem. Since the upper body kinematics are redundant,XMOB deals with possible ambiguities by using a motion seg-ment to perform inverse kinematics instead of single frame andadd the secondary goal of minimizing inner joint displacementduring that motion segment. The problem is restated as follows.Given a motion segment of extremal points (head and hands)

, the goal is to find the corresponding inner joint sequencethat satisfies the joint constraint C and the optimization

target function is to minimize the total inner joint displacement.The assumptions used by XMOB can be summarized as follows.

1) Most of influential information of upper body motion iscarried by the arms.

2) Human body has the symmetry between left and right.During a period of time, this left–right balance tends to bepreserved (though it might not be true at a single frame).

3) Human tend to optimize their movement so that the jointdisplacement is minimized.

Although these assumptions are kind of heuristic, our exten-sive experimental validation on different users with indoor andin vehicle environments implied their feasibility.

1) Shoulder Joint Position Prediction: Since the shoulderjoints typically move with a much lower frequency than thehands and elbow joints, XMOB updates temporal shoulder posi-tion less frequently. For a given temporal segment, XMOB onlypredicts a single position for shoulder joints. Since human bodyshows a bilateral symmetry between left and right, this left–right“balance” tends to be preserved during a period of time. Notethat this assumption does not mean a strict symmetry betweenleft and right hands. As shown in Fig. 3, this left–right “balance”assumption can be interpreted as follows. If we compute the cen-troid of left-hand trajectory and the centroid of right-hand trajec-tory during a temporal segment, they should be symmetric (over

) when projected onto the shoulder line. This also means thatthe shoulder line should be perpendicular to the line from thecenter to the center . From the centroid of head trajectory

, can be computed since the neck line has fixed lengthand is vertical. Denote a point in 3-D as a column vector of threecoordinates and the shoulder joint is , we have thefollowing.

1) Shoulder length is known from the initialization (seeSection III-A)

(1)

2) Shoulder line is perpendicular to the neck line

(2)

Page 5: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

182 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012

Fig. 3. Update the shoulder line orientation for a whole temporal segment usingthe constraint of preserving body left–right “balance.”

3) Shoulder line is perpendicular to (left–right “bal-ance” assumption over a period of time)

(3)

where can be computed as the middle of the centroids ,of hands trajectories projected on the same horizontal

plane at shoulder line.Using (1)–(3), we can solve for the 3-D coordinates of

shoulder joint. Equation (1) is quadric so normally, we havetwo solutions , corresponding to left and rightshoulder joints, respectively. The left and right are selected sothat left shoulder is on the side of left hand and right shoulderis on the side of right hand:To avoid ill-conditioned situations, we do not update shoulderjoint positions when is the same or close to since thesymmetry constraint (3) will be too sensitive to little changesin position.

2) Elbow Joint Sequences Prediction: Since the length ofupper arm and lower arm is fixed, possible elbow joint positionswith known shoulder joint position S and hand position H willlie on a circle, as shown in Fig. 4(a). Furthermore, we realizethat in a natural and comfortable position, elbow joint wouldlie roughly on the lower outside (points away from the body)quarter part of the circle (the bold part of the circle). Thiscan be considered as a geometric interpretation of the physicalconstraints on shoulder joints and elbow joints, which limit thepossible range of joint angle within a degree-of-freedom, intoan approximate geometrical constraint.

Fig. 4(a) shows a special case when the line from shoulderjoint to hand position SH is the same as axis. Here, we as-sume a coordinate system attached to the shoulder joint with

axis being the horizontal direction of shoulder line, axisbeing the vertical direction (e.g., neck line), and axis beingthe direction facing forward. Determining and quantizing the arc

can be done as follows. Knowing S and H positions, thelength of upper arm and lower arm, we can compute the center

and the radius of the concerned circle.1) The equations for points on the men-

tioned 3-D circle are

(4)

(5)

Fig. 4. Elbow joints prediction. (a) Generate elbow candidates at each frames.(b) Over a temporal segment, select the sequence of elbow joints that minimizesthe joint displacement. By adding two pseudo nodes � and � with zero-weightededges, this can be represented as a shortest path problem.

2) The outside lower quarter part is determined byand for left elbow (or for

right elbow).The quantizing process is done by sampling the or

coordinate in the aforementioned range. In a general case, whenthe hand and shoulder are in arbitrary positions, the afore-mentioned set of equations for a 3-D circle and the quantizingprocess become a bit more complex to determine and solve.We deal with this by first finding the rotation matrix to rotatethe line from shoulder joint to hand SH to axis to come backto the special case in Fig. 4(a). After computing “candidates”for elbow joints in this special case, we use the inverse rotationmatrix to transform these “candidates” back to actualelbow candidates.

The selection of elbow candidate sequence (over a temporalsegment) that minimize the total joint displacement can be rep-resented as a shortest path problem [see Fig. 4(b)]. Due to thelayer structure of the graph in this case, the shortest paths fromthe source node s to nodes in a layer are known when we reachthat layer (does not depend on the next layers). Therefore, wecan implement a dynamic programming approach to solve thisshortest path problem in linear time complexity ( ), whereis the number of frames in the temporal segment.

C. LCS Similarity Measurement

Using the skeletal model for upper body (see Section III),we have eight joint angles in total (three for each shoulder jointand one for each elbow joint). The joint angle dynamics in atemporal segment can then be extracted fromupper body pose tracking output. To measure the similaritybetween joint angle sequences, we chose LCS measurementswhich has been used for trajectory similarity measurement

Page 6: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

TRAN AND TRIVEDI: 3-D POSTURE AND GESTURE RECOGNITION FOR INTERACTIVITY IN SMART SPACES 183

and has been shown to be more robust to noise than Euclideanand dynamic time warping [33]. Consider two sequences

and

(6)where Head(A) is the remaining sequence of A after removingthe last element; control thresholds: (determines if elements

in A and in B are matched or not), (measure similarityonly when the joint angles are changing), (tolerates some timeshift in matching the two sequences). Dynamic programming isused to avoid the massive recursive computations.

Here, we applied the LCS algorithm in [33] to compute thesimilarity between joint angle sequences with a small changeto measure the similarity only when joint angles are changing[shown in (6)]. This helps to avoid the case when the similarityof unchanged joint angles (not useful for gesture recognition)dominates the similarity of acting joint angles.

D. Nearest Neighbor Clustering Based on LCS Measurementsfor Gesture Classification

Denote as the training set for gesture ,where is the number of training samples. We compute the sim-ilarity between each pair of , in

(7)

Consider a multivariate vector of similarity at each jointangle, we compute the mean and covariance

(8)

characterizes gesture , e.g., right arm (RA) punchingsequences have a pattern with high similarity in , . Thecentroid sample for gesture is chosen from so that

(9)

where is the Mahalanobis distance

(10)

Given a test sample , the distance from to a gesture clusteris computed as

(11)

Fig. 5. Plot of �-, � -, �-coordinate of the estimated elbow position (solidblack lines) compared to the ground truth (dotted blue lines) from motion cap-ture system. We see that some movement patterns of the elbows are captured inthe estimates.

TABLE IISPATIAL TRACKING ERRORS COMPARED WITH THE GROUND TRUTH

in which the first term measures how testsample is close to gesture cluster , while the second termtakes into account how well the cluster characterize the jointangles dynamics in . For example, the first term may have ahigh similarity score if part of the joint angle dynamics in ges-ture (e.g., ) has similar motion pattern as in cluster .However, if gesture also has joint angle dynamics (e.g., )which do not appear in cluster , the second term will help to tellthat cluster does not fully characterize gesture . We assumea “cluster” with the mean and identity covariance ma-trix so roughly the second term becomes the Euclidean distancefrom to . The sum is used to choosethe closest gesture cluster.

IV. EXPERIMENTAL VALIDATION

A. Upper Body Pose Tracking Results

Our setups have two color cameras configured with a widebaseline to observe the 3-D movement of head and hands. Wecaptured several data sequences with different users in differentindoor backgrounds to evaluate XMOB upper body tracking.XMOB has also been used for upper body pose tracking in ve-hicle environment [27]. In order to have a quantitative evaluationfor some indoor sequences, we also use a marker-based motioncapture system to obtain a baseline ground truth of upper bodymotion simultaneously with the video data for XMOB input.The system runtime is about 15 frames/s on an Intel Core i73.0 GHz.

Table II shows the tracking error of head, left hand, right hand,left elbow, and right elbow compared to the ground truth on a2-min sequence in which the subject did several typical gestures

Page 7: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

184 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012

Fig. 6. Visual results of 3-D upper body pose tracking shown with color lines. White lines are ground truth pose. White clouds are voxel data.

Fig. 7. Superimposed 3-D pose tracking results on image for visual evaluation. (Rows 1 and 2) Sample results on a meeting room scene with different test subjects.(Row 3) Sample results on a driving scene.

including clapping, waving, pushing, pointing, and some gestic-ulation with one hand. Note that the ground truths in this exper-iment are also not the exact joint positions, e.g., to avoid occlu-sion of markers on some skin regions, we put markers on thewrist to detect hand position. Therefore, the mean error mightbe quite large due to this base error. Because the real error can bean addition or subtraction from this base error, we cannot simplysubtract this base error from the measured error. However, thevariance of the measured error will give us a sense about theorder of the real error. Fig. 5 shows directly the estimated 3-Dposition of elbows compared to the ground truth, which indi-cates that these estimates can capture the movement pattern ofjoints. Since there are variances in for the same gesture (e.g.,in clapping gesture, people can clap in different direction, at dif-ferent height), the movement patterns seem to provide more in-fluential information than the exact joint positions for gestureanalysis.

Some results for visual evaluation are shown in Fig. 6 (thevisual result of 3-D upper body pose tracking compared tothe ground truth). Fig. 7 shows visual evaluation of the posetracking result superimposed on input image with differentsubjects in both indoor and in vehicle environments.

1) Comparison With Kinematically Constrained GaussianMixture Model (KC-GMM) Method on the Boxing and WavingSequences From the Public HumanEva-I Dataset: We applyXMOB on two color views of the boxing and waving sequencesin HumanEva-I dataset. For an indicative comparison, we alsoapply the KC-GMM method [3] on all three color views ofthese sequences. Fig. 8 and Table III show the qualitative andquantitative comparison regarding the spatial tracking accura-cies (with marker-based ground truths from HumanEvaI) and

the runtime. Since KC-GMM method uses voxel data of fullbody, it is more sensitive to noise in voxel data due to imagesegmentation quality and limited number of camera views. Ac-tually, KC-GMM method loses track of the arms just a whileafter the manual initialization of the start pose which resulted inhigh tracking errors of the elbows. On the other hand, XMOBonly need the voxel data of head and hands, so it could workpretty well with only two views. We did observe some failurecases of XMOB: when hands are too close, there is a misas-signment between left and right hand (cross arms versus normalarms). However, this misassignment was recovered when handsmove apart. In Fig. 8, the second image of XMOB results onboxing sequence indicated a situation when the assumption ofleft–right balance in estimating the shoulder line does not hold(3). In such cases, we have larger errors in shoulder joints andelbow joints estimation. However, it is important that XMOBcan still work and will recover when these difficulties disap-pear. Regarding the runtime, although we need to take into ac-count that KC-GMM ran in MATLAB (on the same machine),the speed difference is still considerable. Note that KC-GMMruntime also depends on the number of iterations to converge(e.g., larger movement typically requires longer runtime).

B. Gesture Recognition Results

The experiment is done on a set of four one-arm gestures: leftarm (LA) punching, RA punching, LA waving, RA waving, andtwo two-arm gestures: clapping and dumbbell curls (see Fig. 9).There are five subjects with different skin tone and height. Eachsubject performs each gesture ten times. Two thirds of them arechosen randomly for training and the other one third is usedfor testing. This testing with random selection of training and

Page 8: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

TRAN AND TRIVEDI: 3-D POSTURE AND GESTURE RECOGNITION FOR INTERACTIVITY IN SMART SPACES 185

Fig. 8. Sample results on (left) the boxing and (right) waving sequences from the public HumanEva-I dataset. (Top row) Superimposed results from XMOB (usingtwo views). (Bottom row) Results from KC-GMM method using three views.

Fig. 9. Six gestures for recognition: LA punching, RA punching, LA waving, RA waving, dumbbell, and clapping.

TABLE IIISPATIAL TRACKING ERRORS ON HUMANEVA-I BOXING AND WAVING SEQUENCES OF XMOB AND KC-GMM METHODS

TABLE IVACCURACY CONFUSION MATRIX OF Six GESTURES FOR ALL Five SUBJECTS

(AVERAGE VALUE OVER Five RUNS). EACH ROW SUMS UP TO 100%

testing set is repeated five times. The average values over dif-ferent runs are then computed. Table IV shows the confusionmatrix (average values over five runs) of six gestures for all fivesubjects.

V. CONCLUDING REMARKS AND FUTURE WORK

We have introduced a novel 3-D posture and gesturerecognition algorithm for real-time interactivity with detailedimplementation. To achieve real-time performance and robust-ness, the exponentially large search problem of upper body

pose tracking is broken into two subproblems: First, the 3-Dmovements of head and hands are tracked based on a robustsemisupervised skin segmentation procedure. Then, the wholeupper body motion is predicted from these extremity move-ments as an inverse kinematics problem. The underlying ideais to use the thing that we can track more reliably with lessocclusion (i.e., extremities) to support the more difficult innerparts. Since we only need to observe extremity movements, theproposed system will also work even if the user wears looseclothes or the clothes color is mixed with the background.Although there are possible ambiguities in the mentioned in-verse kinematics problem, we have dealt with it by a numericalmethod based on temporal dynamics of extremity movements(not just position at a single frame) and some heuristic assump-tions including the minimization of total joint displacement.Our extensive experimental evaluation with various subjects indifferent environments indicated the feasibility of applying thisapproach in several realistic interactive applications.

Using the output of 3-D posture tracking for gesture recogni-tion, we have developed a nearest neighbor clustering algorithmbased on LCS similarity measure of joint angle dynamics. Our

Page 9: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

186 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012

initial experiment showed good classification rate (over 90% onaverage) for six gestures performed by five subjects. This im-plies the value of having joint angle dynamics information forgesture recognition. Developing more sophisticated approaches,exploiting the marker-less pose tracking output as well as com-bining body posture with other features (e.g., audio visual com-bination [23]) for activity analysis are promising directions topursue. Currently our system has not actually dealt with the ges-ture spotting issue, so it is obviously another direction for futurework.

ACKNOWLEDGMENT

The authors would like to thank colleagues at Computer Vi-sion and Robotics Research Laboratory for their valuable adviceand assistance. C. Tran would like to thank Vietnam EducationFoundation for its sponsorship.

REFERENCES

[1] O. Bernier, P. Cheung-Mon-Chana, and A. Bougueta, “Fast non-parametric belief propagation for real-time stereo articulated bodytracking,” Comput. Vis. Image Understand., vol. 113, no. 1, pp. 29–47,2009.

[2] F. Caillette, A. Galata, and T. Howard, “Real-time 3-D human bodytracking using learnt models of behaviour,” Comput. Vis. Image Un-derstand., vol. 109, pp. 112–125, 2008.

[3] S. Y. Cheng and M. M. Trivedi, “Articulated human body pose infer-ence from voxel data using a kinematically constrained Gaussian mix-ture model,” in Proc. 2nd Workshop Eval. Articulated Human MotionPose Estimat., 2007.

[4] G. Cheung, S. Baker, and T. Kanade, “Shape-from-silhouette of artic-ulated objects and its use for human body kinematics estimation andmotion capture,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pat-tern Recognit., 2003, pp. I-77–I-84.

[5] S. Corazza, L. Mundermann, E. Gambaretto, G. Ferrigno, and T. An-driacchi, “Markerless motion capture through visual hull, articulatedICP and subject specific model generation,” Int. J. Comput. Vis., vol.87, no. 1–2, pp. 156–169, 2010.

[6] A. Datta, Y. Sheikh, and T. Kanade, “Linear motion estimation for sys-tems of articulated planes,” in Proc. IEEE Comput. Soc. Conf. Comput.Vis. Pattern Recognit., 2008, pp. 1–8.

[7] A. Doshi, B. Morris, and M. M. Trivedi, “On-road prediction of driver’sintent with multimodal sensory cues,” IEEE Pervasive Comput., vol.10, no. 3, pp. 22–34, Mar. 2011.

[8] J. Gall, B. Rosenhahn, T. Brox, and H. Seidel, “Optimization and fil-tering for human motion capture: A multi-layer framework,” Int. J.Comput. Vis., vol. 87, no. 1–2, pp. 75–92, 2010.

[9] H. Ghasemzadeh and R. Jafari, “Physical movement monitoring usingbody sensor networks: A phonological approach to construct spatialdecision trees,” IEEE Trans. Ind. Informat., vol. 7, no. 1, pp. 66–77,Feb. 2011.

[10] G. Gomez and E. Morales, “Automatic feature construction and asimple rule induction algorithm for skin detection,” in Proc. ICMLWorkshop Mach. Learn. Comput. Vis., 2002, pp. 31–38.

[11] M. Hofmann and D. M. Gavrila, “Multi-view 3D human pose estima-tion in complex environment,” Int. J. Comput. Vis., pp. 1–22, 2011,doi:10.1007/s11263-011-0451-1.

[12] M. B. Holte, C. Tran, M. M. Trivedi, and T. B. Moeslund, “Humanactivity recognition using multiple views: A comparative perspectiveon recent developments,” in Proc. Workshop Multimedia Access 3DHuman Objects, ACM Int. Conf. Multimedia, Dec. 2011.

[13] K. S. Huang and M. M. Trivedi, “3D shape context based gesture anal-ysis integrated with tracking using omni video array,” in Proc. IEEEComput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshop, 2005.

[14] G. Johansson, “Visual perception of biological motion and a model forits analysis,” Percept. Psychophys., vol. 14, no. 2, pp. 201–211, 1973.

[15] J. Kofman, X. Wu, T. J. Luu, and S. Verma, “Teleoperation of a robotmanipulator using a vision-based human–robot interface,” IEEE Trans.Ind. Electron., vol. 52, no. 5, pp. 1206–1219, Oct. 2005.

[16] H. Lee and J. H. Kim, “An HMM-based threshold model approach forgesture recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21,no. 10, pp. 961–973, Oct. 1999.

[17] R. Li, T. P. Tian, S. Sclaroff, and M. H. Yang, “3D human motiontracking with a coordinated mixture of factor analyzers,” Int. J.Comput. Vis., vol. 87, no. 1–2, pp. 170–190, 2010.

[18] M. C. D. Mendonca, “Estimation of height from the length of longbones in a Portugese adult population,” Amer. J. Phys. Anthropol., vol.112, no. 1, pp. 39–48, 2000.

[19] A. S. Micilotta, E. Ong, and R. Bowden, “Real-time upper body de-tection and 3D pose estimation in monoscopic images,” in Proc. Eur.Conf. Comput. Vis., 2006, pp. 139–150.

[20] I. Mikic, M. M. Trivedi, E. Hunter, and P. Cosman, “Human bodymodel acquisition and tracking using voxel data,” Int. J. Comput. Vis.,vol. 53, no. 3, pp. 199–223, 2003.

[21] R. Poppe, “Vision-based human motion analysis: An overview,”Comput. Vis. Image Understand., vol. 108, pp. 4–18, 2007.

[22] Y. Kogay, K. Kondoz, J. Kuffnery, and J. Latombey, “Planning motionswith intentions,” Proc. SIGGRAPH, pp. 395–408, 1994.

[23] S. T. Shivappa, M. M. Trivedi, and B. D. Rao, “Audio-visual informa-tion fusion in human computer interfaces and intelligent environments:A survey,” Proc. IEEE, vol. 98, no. 10, pp. 1692–1715, Oct. 2010.

[24] L. Sigal and M. J. Black, “Guest editorial: State of the art in image andvideo based human pose and motion estimation,” Int. J. Comput. Vis.,vol. 87, no. 1, pp. 1–3, 2010.

[25] J. Soechting and M. Flanders, “Errors in pointing are due to approx-imations in sensorimotor transformations,” J. Neurophysiol., vol. 62,no. 2, pp. 595–608, 1989.

[26] A. Sundaresan and R. Chellappa, “Model driven segmentation of artic-ulating humans in Laplacian Eigenspace,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 10, pp. 1771–1785, Oct. 2008.

[27] C. Tran and M. M. Trivedi, “Driver assistance for keeping hands on thewheel and eyes on the road,” in Proc. IEEE Int. Conf. Veh. Electron.Safety, 2009, pp. 97–101.

[28] C. Tran and M. M. Trivedi, “Introducing XMOB: Extremity movementobservation framework for upper body pose tracking in 3D,” in Proc.11th IEEE Int. Symp. Multimedia, 2009, pp. 446–447.

[29] C. Tran and M. M. Trivedi, “Vision for driver assistance: Looking atpeople in a vehicle,” in Guide to Visual Analysis of Humans: Lookingat People, T. B. Moeslund, L. Sigal, V. Krueger, and A. Hilton, Eds.New York: Springer-Verlag, 2011.

[30] M. M. Trivedi and S. Y. Cheng, “Holistic sensing and active displaysfor intelligent driver support systems,” IEEE Comput., vol. 40, no. 5,pp. 60–68, May 2007.

[31] M. M. Trivedi, T. L. Gandhi, and K. S. Huang, “Distributed interactivevideo arrays for event capture and enhanced situational awareness,”IEEE Intell. Syst., vol. 20, no. 5, pp. 58–66, Sep./Oct. 2005.

[32] M. M. Trivedi, K. S. Huang, and I. Mikic, “Dynamic context captureand distributed video arrays for intelligent spaces,” IEEE Trans. Syst.,Man, Cybern. A Syst., Humans, vol. 35, no. 1, pp. 145–163, Jan. 2005.

[33] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering similar mul-tidimensional trajectories,” in Proc. IEEE 18th Int. Conf. Data Eng.,2002, pp. 673–684.

[34] J. Wang, P. Liu, M. She, and H. Liu, “Human action categorizationusing conditional random field,” in Proc. IEEE Workshop Robot. Intell.Inform. Structured Space, 2011, pp. 131–135.

[35] E. Yu and J. K. Aggarwal, “Human action recognition with extremitiesas semantic posture representation,” in Proc. IEEE Comput. Soc. Conf.Comput. Vis. Pattern Recognit., 2009, pp. 1–8.

[36] J. Ziegler, K. Nickel, and R. Stiefelhagen, “Tracking of the articu-lated upper body on multi-view stereo image sequences,” in Proc. IEEEComput. Soc. Conf. Comput. Vis. Pattern Recognit., 2006, pp. 774–781.

Cuong Tran (S’10) received the B.S. degree fromthe Hanoi University of Technology, Hanoi, Vietnam,and the M.S. degree from the University of California(UC), San Diego, both in computer science, in 2004and 2008, respectively. He is currently working to-ward the Ph.D. degree at the Computer Vision andRobotics Research Laboratory, UC.

His research interests include vision-based humanpose estimation and activity analysis for inter-active applications, intelligent driver assistance,human–machine interfaces, and behavior prediction.

Mr. Tran is a Vietnam Education Foundation Fellow.

Page 10: 178 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 8, NO. 1, FEBRUARY 2012 …cvrr.ucsd.edu/publications/2012/Tran_TII2012.pdf · 2012-02-08 · 178 IEEE TRANSACTIONS ON INDUSTRIAL

TRAN AND TRIVEDI: 3-D POSTURE AND GESTURE RECOGNITION FOR INTERACTIVITY IN SMART SPACES 187

Mohan Manubhai Trivedi (F’09) received the Ph.D.degree in electrical engineering from Utah State Uni-versity, Logan, in 1979.

He is a Professor of Electrical and ComputerEngineering and the Founding Director of the Com-puter Vision and Robotics Research Laboratory andthe Laboratory for Intelligent and Safe Automobiles,University of California, San Diego. His currentresearch interests include machine and humanperception, machine learning, human-centered mul-timodal interfaces, intelligent transportation, driver

assistance, and active safety systems. He serves as a Consultant to industry andgovernment agencies in the U.S. and abroad, including the National Academies,major auto manufactures, and research initiatives in Asia and Europe.

Prof. Trivedi is a Fellow of the International Association of Pattern Recog-nition for contributions to vision systems for situational awareness and human-centered vehicle safety, and a Fellow of The International Society for OpticalEngineers for distinguished contributions to the field of optical engineering.