A Prototype for 3-D Hand Tracking and Posture...

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 57, NO. 8, AUGUST 2008 1627

A Prototype for 3-D Hand Trackingand Posture Estimation

Ayman El-Sawah, Nicolas D. Georganas, and Emil M. Petriu, Fellow, IEEE

Abstract—In this paper, we explain our experience in designinga prototype for 3-D hand tracking and dynamic gesture recog-nition. Our objective is to be able to continuously visually trackthe hand in a general background and to be able to recognizedynamic gestures in real time. The constraints and the conditionsfor our system to justify our approach are generic nonrestrictedenvironment and generic nonspecific application. Our prototypehas undergone several stages—from simulation to three differentcycles of development and testing. We also present a road map forfuture development to reach the final goal.

Index Terms—Computational geometry, error model analysis,hand modeling, multijoint finger inverse kinematics, sensor cali-bration, vision-based hand posture estimation.

I. INTRODUCTION

DYNAMIC gesture recognition applications require theprovision of a high data rate of hand postures traditionally

provided using motion tracking gloves that are capable ofaccurately recording finger joint motions via flex sensors ina tightly fitting glove. Although data gloves are very precisehand posture-estimation devices, they are very expensive formost commercial applications. We use a camera-based posture-estimation system motivated by the cheap prices of cameraperipherals and the increased processing power of moderncomputers, and we use the data glove to calibrate and validateour system.

Three-dimensional vision-based hand tracking is very chal-lenging due to the high DOFs of the hand and fingers, the lossof depth information due to image projection, self-occlusion,and the limitation of image feature extraction techniques. In ourquest for 3-D hand tracking, we use the minimum requirementof features to estimate the 26-DOF hand posture; however, ourapproach can utilize more features as they become available.The detected features are used to generate a hand posturehypothesis, which is then validated using an observation model.

Although we are using markers to identify the gesturing handand the fingertips, our system handles the noise produced fromthe environment and provides hand postures in the form of26 DOF, which is similar to a CyberGlove and a 6-DOF positionsensor. The overview of our system is shown in Fig. 1. First, the

Manuscript received July 5, 2007; revised May 4, 2008. This work wassupported by the Learning Object Repositories Network–Natural Sciences andEngineering Research Council of Canada research network.

The authors are with the Discover Laboratory, University of Ottawa, Ottawa,ON K1N 6N5, Canada (e-mail: [email protected]; [email protected]; [email protected]).

Color versions of one or more figures in this paper are available online athttp://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIM.2008.925725

Fig. 1. Posture estimation system overview.

detected 2-D features are used to produce a 3-D posture hypoth-esis (2-D → 3-D). Then, the hypothesis is validated by project-ing the 3-D hand model back to the image plane (3-D → 2-D)and calculating the probability of the hypothesis given otherobserved hand features using our observation model.

The following section describes the state of the art in vision-based hand tracking followed by a description of the proposedapproach.

II. RELATED WORK

Lee and Kunii [1] provided a comprehensive analysis ofmodel-based hand tracking using a video camera and a color-coded glove. They tracked seven characteristic points on thehand, defined a set of constraints to simplify the finger in-verse kinematics, and provided a fitting algorithm to fit thehand model to the hand image based on the hand constraintsand features extracted. Stenger et al. [5], [6] used a 27-DOFtruncated conics-based hand model and an unscented Kalmanfilter to track hand postures in a 3-D trajectory. They extractedthe hand model edges, handled occlusion using a simple ray-tracing technique, compared the projected model edges withthe edges extracted from the video, and produced a differencevector that was fed to the Kalman gain matrix to estimatethe new posture (state) of the model in an attempt to mini-mize the difference between expected and observed postures.Wu et al. [7] tracked pointing finger gestures using a singlecamera. They used skin color segmentation to extract the faceand the arm of the operator and used the curvature peak todecide the position of the pointing finger. The shoulder locationis derived from the location of the head, and the elbow positionis derived at the maximum distance from the line joining thefingertip and the shoulder. They used spherical kinematics andorthographic projection to solve the 6 DOF for the shoulder

0018-9456/$25.00 © 2008 IEEE

1628 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 57, NO. 8, AUGUST 2008

and elbow joints, and they derived the pointing direction in 3-Drelative to the shoulder. Lu et al. [8], [9] used multiple opticalcues in the form of edges, optical flow, and shading to derive3-D forces that are applied to a 26-DOF hand model. Theyused a forward recursive dynamic model to track the motionin response to the 3-D derived forces. We consider their workcomplementary to our own, as we are both using multiple visualcues and a 26-DOF hand model, although their approach useddynamics and forward kinematics concepts, whereas ours usesinverse kinematics to track the hand. Buchmann et al. [10] usedthe ARToolkit tracking library to track three feature points onthe hand in a gesture-enabled urban design augmented realityapplication. The thumb and index fingers were used to graspbuilding models and move them around the application. Thethree markers were used to render a 1-DOF model of the indexand thumb fingers enough to perform a two-finger grasp gestureof the virtual building in the scene. Lathuiliere and Herve [11]proposed a real-time visual hand tracking and posture esti-mation system for grasping gestures. They used a dark gloveand fingertip-colored markers, as well as three dot markers onthe rest. They used a 26-DOF hand model and forward andinverse kinematics to estimate the hand posture. Their approachis similar to ours, although they used simplified kinematicsfor the hand, for example, assuming that the finger’s distaland proximal intermetacarpal [distal interphalangeal (DIP) andproximal interphalangeal (PIP)] joints have the same angularvalues. We use the corners of a square marker in contrast tothree point markers to detect the hand position and orientationin 3-D, which gives more accurate results by avoiding skindeformation. The qualification of the accuracy of the vision-based posture estimation approach has, to the best of ourknowledge, not been explored before for marker or marker-lesssystems.

III. PROPOSED APPROACH

Many researchers have proposed hand tracking of a marker-less hand; however, they have not provided a measure ofthe accuracy of the estimated hand posture, and they usuallyrequire a constant and noiseless background for their images.We decided to use a marker glove to be able to compareour posture estimation with a CyberGlove and to allow ourprototype to be used in an unconstrained cluttered environment.The markers are used to generate a hand posture hypothesis,which is qualified using an objective function representing theprobability of the posture hypothesis given the acquired image.This objective function will, hereafter, be called the observationmodel.

Our original proposed approach [14] included a brief descrip-tion of the 2-D-feature-to-3-D-posture transformation; in [16],we introduced our observation model that is used to qualify thederived 3-D hypothesis from ground truth obtained from theacquired video. In this paper, we explain the different stagesthat our prototype went through and provide a collective view ofthe results we obtained. The first stage is a graphical simulationof the hand tracking problem in which we set our requirementsfor successful hand tracking and validate our tracking algorithmusing simulation. The second stage is tracking with a real video

Fig. 2. Hand model visualization.

Fig. 3. System overview—Phase 1.

camera and validating the accuracy of our tracking system usingthe CyberGlove as a reference. In the third stage, we extend oursystem to multiple cameras.

In the following sections, we describe in more detail the 2-Dfeature extraction, the 26-DOF hand model, the 2-D-feature-to-3-D-posture transformation, and the observation model.

A. Prototype Phase 1—Simulation

In this phase, we took a simulation approach to investigatethe solvability of the hand tracking problem using a singlecamera. A CyberGlove was used to drive our hand model,which is visualized in Open Scene Graph, as shown in Fig. 2.The 2-D projections of the model’s fingertips and the corners ofa hypothetical square marker at the model’s root were used toestimate the 3-D hand posture.

The goal was to address the challenges facing 3-D vision-based hand tracking, namely loss of depth data due to projectionand processing speed. The proposed framework (see Fig. 3)is based on geometric computations and inverse kinematicsand demonstrates the feasibility and the limitations of vision-based hand tracking using a single camera. It decouples posecalculation from pose acquisition by introducing the concept

EL-SAWAH et al.: PROTOTYPE FOR 3-D HAND TRACKING AND POSTURE ESTIMATION 1629

Fig. 4. Three-dimensional hand model.

of pose prediction, which acts as a noise filter, an ambiguityresolver, and a posture extrapolator.

A first-in–first-out (FIFO) sequence of estimated postures isused to predict the hand postures at any point in time. Postureprediction can be acquired internally by the tracker or externallyby a gesture-recognition module. The estimated posture FIFOsequence is, thus, protected by a mutual exclusion construct.

A set of minimal requirements for real-time 3-D handtracking using a single camera is proposed, and feasibilityis demonstrated using graphics. To maintain hand tracking,we require information about hand position and orientation,which is provided by a palm marker, and information abouthand posture, which is provided by markers positioned at thefingertips. The simulation utilized a 26-DOF hand model todetermine the 3-D position of the fingertips derived by theCyberGlove and to calculate the finger joints from the end-effector position using inverse kinematics.1) Hand Model: The 26-DOF hand model, shown in Fig. 4,

provides forward and inverse kinematics transformation utili-ties, which are extensively used by the 2-D/3-D features-to-posture transformation and the 3-D/2-D posture hypothesisprojection on the image plane.

The hand model consists of five kinematics chains modelingthe fingers that are connected to the root. All fingers, exceptfor the thumb, consist of four links; the first link is fixed to theroot and is part of the palm, and the second link is connectedto the chain by two 1-DOF revolute joints connected by anabstract zero-length link for flexion and abduction movements.The other two links are connected to the chain through a1-DOF revolute joint providing flexion movement. The thumb

Fig. 5. Two-dimensional/three-dimensional posture conversion.

has only three links; the first link is connected to the root by two1-DOF revolute joints, which are connected to each other by anabstract zero-length link to model the 2-DOF carpometacarpaljoint. The other two links are connected to the chain through a1-DOF revolute joint.2) 3-D/2-D Feature-to-Posture Transformation: In this sec-

tion, we explain how the 3-D model data are projected on theimage plane. As shown in Fig. 4, the hand is modeled as anarticulated figure, which is a structure that consists of a seriesof rigid links that are connected at joints [18]. Mathematically,the links are represented as a 3-D translation vector, and thejoints are represented by a 3 × 3 rotation matrix. They canboth be combined by a 4 × 4 homogeneous transformationmatrix. Thus, by knowing the position and the orientation ofthe model’s root in 3-D, we can evaluate the 3-D position ofall points of the hand model—including the fingertips—usingforward kinematics. A transformation between the model/worldcoordinate system and the eye/camera coordinate system canalso be represented by a 4 × 4 transformation matrix calledthe model view matrix; details of calculating such a matrix areexplained in Section III-A3. The model can then be projected(orthogonally or respectively) on the image plane using a well-defined projection matrix [18].3) 2-D/3-D Feature-to-Posture Transformation: The de-

tected 2-D marker features are used to generate a hand-posturehypothesis. As shown in Fig. 5, we are using the concepts ofperspective geometry and inverse kinematics. First, the palmmarker corners are used to calculate the hand position andorientation with respect to the camera. We use a pinhole cameramodel and perspective geometry in our derivation. The trans-formation is formally defined by the camera-to-hand-modelcoordinate transformation (T ), which is expressed in homoge-neous coordinates formed by a 4 × 4 matrix that includes anorthonormal 3 × 3 rotation matrix and a translation vector. Themodel view transformation is calculated as the product of thecamera-to-palm-marker transformation T1, which is calculatedon a frame-by-frame basis, and the palm-marker-to-hand-modeltransformation T2, which is considered constant and is calcu-lated during system calibration.


Fig. 6. Hand orientation calculation.

Knowing the palm marker dimensions and the camera fo-cal distance, it is possible to determine the camera-to-markertransformation T1 from the location of the marker corners onthe image plane [3].

Hand position and orientation: The problem of geometryof orientation is illustrated in Fig. 6.

We use the operator ⊗ to denote a vector cross-product oper-ator and the operator • to denote a vector dot-product operator.The unit vectors mA, mB , and mC , which are called N -vectors[3], represent the trace of label corners A,B, and C, respec-tively, as seen from the camera viewpoint, and are given by

mA =(xA, yA, f)‖xA, yA, f‖

mB =(xB , yB , f)‖xB , yB , f‖

mC =(xC , yC , f)‖xC , yC , f‖ (1)

where f is the camera focal length, and (x, y) denotes the pixelcoordinates of the point with respect to the camera principalpoint. The position of the corners A,B, and C in 3-D isgiven by

OA = mArA

OA = mBrB

OA = mCrC . (2)

The plane normal N is given by the cross-product AB ⊗BC, i.e.,

N =AB ⊗ BC

= (rBmB − rAmA) ⊗ (rCmC − rBmB). (3)

The orientation problem is solved by determining the depthsrA, rB , and rC using the geometrical constraints of the marker,which are given as follows:

‖AB‖2 = r2A + r2

B − 2rArB(mA • mB)

‖BC‖2 = r2B + r2

C − 2rBrC(mB • mC)

‖CA‖2 = r2C + r2

A − 2rCrA(mC • mA) (4)

cos(β) = rBrC(mB • mC) + rArB(mA • mB)

− rArC(mA • mC) − r2B

cos(α) = rArB(mA • mB) + rCrA(mC • mA)

− rCrB(mC • mB) − r2A

cos(γ) = rCrA(mC • mA) + rBrC(mB • mC)

− rBrA(mB • mA) − r2C . (5)

The geometry is simplified using a square marker. In this case,AB = BC = L, and β = 90◦, i.e.,

L2 = r2A + r2

B − 2rArB(mA • mB)

= r2B + r2

C − 2rBrC(mB • mC) (6)

0 = rBrC(mB • mC) + rArB(mA • mB)

− rArC(mA • mC) − r2B . (7)

Two quadratic equations in rA/rB and rC/rB can be obtained,and two possible solutions for rA, rB , and rC are obtained. Wecan assume, without loss of generality, that AB is the x-axis,BC is the y-axis, and B is the origin of the features coordinatesystem. The feature-to-camera transformation T1 is given by

T−11 =

[X Y Z OO′

0 0 0 1

]

OO′ = rBmB (8)

X =(rBmB − rAmA)‖rBmB − rAmA‖

Y =(rCmC − rBmB)‖rCmC − rBmB‖

Z = X ⊗ Y

T1 =[

RT −RT OO′

0 1

]

R = [ X Y Z ]

RRT = RT R = I. (9)

The marker-to-hand-model root transformation T2 is consid-ered to be the identity matrix during the simulation and is deter-mined during the system calibration in the second phase.

Finger posture: The detected finger markers are used todetermine a linear segment that is reachable by the finger alongthe camera view direction through the 2-D feature on the imageplane, as shown in Fig. 7.

The camera’s intrinsic parameters, namely the focal lengthand the principal point, are used to define the locus of the finger-tip from the projected marker position. The finger length deter-mines the start and the end of the linear segment. Features thatare not reachable by the finger are rejected as noise. The reach-able linear segment is sampled at constant lengths to calculatea finger posture hypothesis at each sample point using inversekinematics transformation. The inverse kinematics of the thumbis performed by a binary search of a lookup table of all feasibleend-effector positions using 0.1-rad granulation of the 4-DOFthumb, which is sorted by the distance of the end-effector tothe thumb’s base. The other finger inverse kinematics is solvedusing an error model analysis technique [15]. The technique


Fig. 7. Detecting 3-D end-effector range.

addresses an error in the finger-abduction calculation that oc-curs due to the isolation of the abduction DOF from the flexionDOFs. It utilized the finger’s natural constraint to calculate the4-DOF finger in two steps. First, the PIP and DIP values areset to zero, and an estimate is simultaneously calculated for theabduction and metacarpophalangeal (MCP) values. Then, theerror is corrected using a predefined error model utilizingthe finger’s natural constraint. Last, the corrected abduction andMCP values are used to calculate the PIP and DIP values.4) Handling Ambiguity: The 2-D/3-D feature-to-posture

transformation provides a set of posture hypotheses that haveyet to be optimized. The posture observation module providesan objective function for posture estimation. The posture pre-diction is based on two factors. The first factor is the canonicalprobability of the posture hypothesis, which is obtained bysampling the CyberGlove data over a long period of time whiletrying as many dynamic gestures as possible. The second factoris the closeness of the posture to our prediction. Posture predic-tion is performed by extrapolating previous posture estimationsby combining a different order extrapolation function. Thelowest (zero) order function is a constant that is equal to thelast estimated posture, the next function is a linear interpolationbetween the last two postures, and so on. To avoid predictioninstability, we systematically drop higher order extrapolatingfunctions as the extrapolating period becomes longer. Thisis mathematically done by adding an exponential dampingfactor to the prediction combination. The following presents thedamped extrapolation polynomial of the ith degree:

℘i(t) = e− (t−t0)

Ti Pi(t) + (1 − e− (t−t0)

Ti )℘i−1(t)

℘0(t) = ν0 (10)

where Pi(t) is the extrapolating polynomial of degree i, ν0 isthe latest estimated value at time t0, and Ti is a damping factorwhere Ti < Ti+1.

The system starts from a posture where all fingers arestraight, and where all DOFs are zero, and proceeds trackingby estimating postures.5) Phase 1 Results: In this experiment, the posture acquired

from the CyberGlove was used to drive the hand model; theprojection of the hand marker and fingertips was used bythe tracking system to estimate the hand posture. The cameraviewpoint was changed by panning, zooming, and moving the

Fig. 8. Statistical error of the estimated DOF.

model in the graphics environment. The results of the trackingsimulation are shown in Fig. 8, where the errors are in radians.

The histograms of the posture estimation error, in compari-son to the CyberGlove measurements, are shown. The error isnormalized by the DOF dynamic range to facilitate comparisonamong the different DOF. We realized that the system bestestimates the MCP joint angle and estimates abduction theworst. The reason is that the MCP joint generally providesthe maximum impact on the end-effector position, and by thesame token, abduction can only be reliably estimated whenthe fingers are not curved, to make the maximum impact. Wederived two measurements of the finger’s curvature, which isgiven as follows:

Ci = 2Inv (Sinc(di/Li)) . (11)

Where Inv(Sinc(x)) is the inverse function of the Sinc function,which is defined as sin(x)/x, d is the distance from the twoends of the finger segment, and L is the length of the fingersegment. The curvature measure is normalized and lies in therange [0, 2π], with a practical range of [0, 4] rad. The major cur-vature includes the finger and the palm, and the minor curvatureincludes only the finger. The estimation error of the curvatureerrors provides better estimates than the MCP, as shown inFig. 9, where the DOF error is normalized by its dynamic range,which means that it is does not have a unit.

The main cause of error in the simulation is the loss ofdepth information due to projection and the lack of visual cues.This caused an intermittent visual illusion; however, the systemquickly recovers, which provides a better viewpoint.

B. Prototype Phase 2—Facing Reality

The simulation provides a proof of concept of the system butsuffers from some limitations. First, the position of the marker


Fig. 9. Statistical error of curvature.

features, calculated from the projection of 3-D features on thegraphic scene, is very accurate and assumes a pinhole cameramodel with well-defined camera intrinsic parameters, which isdifferent from the real situation where the camera parametersare calculated by a calibration procedure and suffers fromapproximation errors. Second, the simulation does not includethe background clutter. Third, the simulation uses the samehand model in forward and inverse kinematics and does not testthe approximation error of the hand model with a real hand.Fourth, the simulation does not test for extrinsic parameterssuch as skew and lens distortion. Last, the simulation did notrequire calculating a second transformation, from the palmmarker to the model’s root, as the marker was centered on theroot. Ideally, this transformation requires a 3-D scanning deviceto calculate the transformation; however, we had to create acalibration procedure for that purpose using a second markerpositioned at the root. The choice of the marker location is madesuch that it provides the best visibility to the camera.

In this phase, we reused the hand model, including theforward and inverse kinematics techniques, and the 2-D/3-Dfeature-to-posture transformation. However, we had to detectthe 2-D features from the acquired video frames, and wemodified the posture prediction module to an observation modelutilizing the segmented hand color and silhouette, as shownin Fig. 1.1) 2-D Feature Extraction: We use a dark glove with color-

coded ring markers to detect the fingertips and 2-cm squaremarkers to derive the palm’s position and orientation withrespect to the camera (Fig. 10). We use two palm markers—onefor the front and the other for the back of the palm; however,more markers can be added as required. The palm markersare identical and can be distinguished using a color-coded tag.The marker corners are precisely extracted by segmenting themarker contour at the four corners, then using least squareserror linear fitting to calculate the four sides of the square, and,finally, calculating the intersection to determine the corner loca-tions in subpixel accuracy. The corners are used to calculate thehand position and orientation in the 3-D space using perspectivegeometry [3]. Five colored ring markers are used to mark thefingertips.

The locations of the finger markers are determined as holes inthe segmented hand. The holes are derived by a morphologicalclosing operation. First, the holes are closed using a dilate op-eration; then, the hand is restored using an erode operation. Toavoid using a large kernel size, dilate and erode operations are

Fig. 10. Projecting the hand model and finger edges.

Fig. 11. Error model of pinky MCP.

iteratively performed using a 3 × 3 kernel, where the numberof iterations is set in proportion to the hand marker size. Theholes are detected by masking the original hand segment fromthe closed hand segment. The markers are identified by theircolors, where we use the hue–saturation–value color space, asit effectively detects the color hue. The 3-D color histogram issegmented using multiple thresholds to determine overlappingcolor ranges for each marker. The number of pixels of eachrange, normalized by the palm marker area, is used as anattribute vector for the detected color blobs. A classifier istrained and used to classify the detected fingertip features usingits color attributes.2) Posture Hypothesis Validation: In the presence of 2-D

feature noise and due to the underconstrained nature of thefinger inverse kinematics, particularly the loss of depth dueto camera projection, a qualifying process must be used toproduce the most probable posture. We propose a probabilisticobservation model utilizing the segmented hand image and itssilhouette. The hand posture hypothesis is applied on the handmodel using forward kinematics, and then, the resulting 3-Dmodel is projected on the image plane, as in Fig. 10. Theprojected model is sampled at (N ) points.

At each point, we calculate the probability of the edgebeing a correct observation given the color of the pixels in itsneighborhood using the glove color histogram. Moreover, thecorresponding finger edge (see Fig. 10) is qualified using an


Fig. 12. Two-camera extension (type 1).

Fig. 13. Two-camera extension (type 2).

observation function, adopted from [12], by assuming that theclutter is a Poisson process along the direction perpendicularto the edge with spatial density λ, that all edges have thesame probability of being the true observation, and that theprojected edge accuracy is a normal distribution with standarddeviation σ. The resulting probability at a sample edge point isgiven as follows, where q is the probability that the true edge isnot detected, and d is the distance between the edge hypothesisand the detected edge:

p(hyp‖obs) ∝ 1 +1√

2πσqλ

∑e−

d2

2σ2 . (12)

The summation in (12) is performed for all the detectededges along the direction perpendicular to the model edge.Since the equation is exponentially decreasing, only a smalldistance range is considered, beyond which, the probability isinsignificant. The 2-D observation model along the N samplepoints is given by the product of point probabilities, i.e.,

P =N∏

i=1

pi(hyp‖edge)pi(hyp‖color). (13)

As we may detect multiple possible targets for the end-effector of a finger, and we do not know the true target feature,we must include all detected targets in one function. We usea similar function to (12) to determine the probability of theend-effector hypothesis, given all the detected 2-D end-effectorfeatures. In this case, λ is the density of the 2-D feature clutter,σ is the standard deviation of the detected feature accuracy,q is the probability that the true feature is not detected or isoccluded, and d is the 2-D distance between the projected end-effector position and the detected feature position; the summa-tion is performed on all detected features. The probability of

the posture hypothesis given the previous estimated posturesis calculated as the sum of normal distributions with meansat the nth-order polynomial extrapolation (n = 0, 1, 2, and 3),and the standard deviation is proportional to the extrapolationperiod, i.e., the time period between the first posture that isused in the polynomial extrapolation and the current time. Last,the overall observation model is given by the product of theaforementioned factors, namely the segmented hand segment,the hand silhouette, the detected end-effector features, and theprevious posture estimates, i.e., P = P1P2P3P4. The posturethat gives the optimal probability according to the observationmodel is provided as the best estimate.3) Phase 2 Results: To measure the accuracy of the vision-

based posture estimation, we gathered several video clips ofhand postures at different hand orientations while wearing theCyberGlove. Although the CyberGlove does not qualify as atrue reference because it undergoes a calibration phase of itsown, it provides a measure of confidence of the vision systemoutput. To avoid facing a synchronization problem betweenthe image acquisition and the CyberGlove data acquisition,we restricted our tests to a static posture; however, we variedonly the orientation. The collective augmentation of all testsprovided us with a reasonable estimation of the hand trackingaccuracy. The joint angles from both the CyberGlove and themodel-based posture estimation are statistically analyzed. Thehistogram generated from every video clip is used to calculatethe median and the confidence intervals for each DOF, whichare then interpolated to derive the confidence interval valuesfor the whole DOF dynamic range. Fig. 11 shows an exampleof the DOF error models that are acquired in this stage. Moredetails can be found in [16].

In this paper, we provide a different view of the results byshowing the error distribution of the different DOF, irrespectiveof the corresponding values, as shown in Fig. 14. In other


words, we focus on the repeatability of the hand posture estima-tion under the variation of the clutter and the hand orientationwith respect to the camera viewpoint.

C. Prototype Phase 3—Multiple Cams

In this phase, we extend the observation model to includemultiple cameras. Camera sensor fusion is done on differentlevels. At the first level, we do not assume knowledge of thegeometrical transformations between the camera coordinatesystems, which is useful when the cameras are moving. In thiscase, the posture hypothesis that is generated using the 2-Dfeatures from each camera is validated using the observationmodels of all the cameras. The overview of the two-camerasystem is shown in Fig. 12.

At the second level of camera sensor fusion, we use the geo-metrical transformation between the camera coordinate frames.In this case, the detected hand position and orientation isvalidated, and the best orientation is used by both models. Theoverview of the system is shown in Fig. 13.

The error analyses of all phases of the systems are collec-tively shown in Figs. 14 and 15 for the fingers and the thumb,respectively, where the errors are in radians.

All the figures show the collective error from the simulation,a single camera, and two cameras (types 1 and 2). In general, thesimulation results provide a more accurate posture estimationdue to the accurate detection of the 2-D features and the absenceof occlusion, hand model approximation, and lens distortionartifacts, although, in some cases, e.g., the thumb minor cur-vature, the vision-based tracking provided better results thanthe simulation; the reason for that is the enhancement of theobservation model that is used in the vision-based tracking,which utilizes the hand color and silhouette. The mean error,such as in the thumb abduction estimation, can be rectifiedby the system calibration, as discussed in [16]. In general, thefinger DOF estimation is more accurate than the thumb dueto the natural constraint of the finger motion. We recommendusing a second marker for the thumb that is positioned close tothe base for a better DOF estimation.

IV. CONCLUSION AND FUTURE WORK

We presented a framework for a vision-based posture estima-tion using a single camera, which can be extended to multiplecameras. The framework is a two-step process, namely, posturehypothesis generation and validation. We validated the frame-work by comparing its outcome to the CyberGlove and showedthat it provides reasonable results. The results are systemati-cally enhanced by using multiple cameras at different levelsof sensor fusion. We demonstrate the ability to use the systemfor dynamic gesture recognition in [17]. The results obtainedfrom the simulation provide an upper bound of the systemaccuracy, although it can be improved by using the observationmodel. Using multiple cameras slightly enhances the accuracybut mainly improves the continuity of the data by increasing theworking area. The system handles intermittent occlusion butdoes not handle occlusion for extended periods of time. Thismust be handled, in our opinion, by understanding the context Fig. 14. Error comparison—Fingers.


Fig. 15. Error comparison—Thumb.

in the form of objects (grasped or manipulated) or conversations(hinting to particular gestures). Another challenge is 3-D handtracking of the marker-less hand. In this case, edge detectioncan be used to detect the fingertips, and the hand orientationmay be estimated from the palm’s silhouette. In all these cases,the concept of hypothesis generation and validation using theobservation model can be reused.

ACKNOWLEDGMENT

The authors would like to thank C. Joslin for his role ininstantiating this research.

REFERENCES

[1] J. Lee and T. L. Kunii, “Model-based analysis of hand posture,” IEEEComput. Graph. Appl., vol. 15, no. 5, pp. 77–86, Sep. 1995.

[2] G. Burdea and P. Coiffet, Virtual Reality Technology. New York: Wiley-Interscience, 2003.

[3] K. Kanatani, Geometric Computation for Machine Vision, ser. OxfordEngineering Science Series #37. Oxford, U.K.: Oxford Science, 1993.

[4] I. Albrecht, J. Haber, and H. P. Seidel, “Construction and animationof anatomically based human hand models,” in Proc. Eurographics/SIGGRAPH Symp. Comput. Animation, 2003, pp. 98–109.

[5] B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model-based hand track-ing using an unscented Kalman filter,” in Proc. BMVC, 2001, pp. 63–72.

[6] B. Stenger, P. R. S. Mendonca, and R. Cipolla, “Model-based 3Dtracking of an articulated hand,” in Proc. IEEE Conf. CVPR, 2001, vol. 2,pp. 310–315.

[7] A. Wu, M. Shah, and N. Da Vitoria Lobo, “Virtual 3D blackboard: 3Dfinger tracking using a single camera,” in Proc. 4th IEEE Int. Conf. Autom.Face Gesture Recog., 2000, pp. 536–543.

[8] S. Lu, D. Metaxas, D. Samaras, and J. Oliensis, “Using multiple cues forhand tracking and model refinement,” in Proc. IEEE Conf. Comput. Vis.Pattern Recog., 2003, vol. 2, pp. 443–450.

[9] S. Lu, G. Huang, D. Samaras, and D. Metaxas, “Model-based integrationof visual cues for hand tracking,” in Proc. IEEE Workshop Motion VideoComput., 2002, pp. 118–124.

[10] V. Buchmann, S. Violich, M. Billinghurst, and A. Cockburn, “FingARtips:Gesture based direct manipulation in augmented reality,” in Proc. 2nd Int.Conf. Comput. Graph. Interactive Tech., 2004, pp. 212–221.

[11] F. Lathuiliere and J.-Y. Herve, “Visual tracking of hand posture withocclusion handling,” in Proc. 15th IEEE Int. Conf. Pattern Recog., 2000,vol. 3, pp. 1129–1133.

[12] A. Blake and M. Isard, Active Contours: The Application of TechniquesFrom Graphics, Vision, Control Theory, and Statistics to Visual Trackingof Shapes in Motion. New York: Springer-Verlag, 2000.

[13] CyberGlove from Immersion. [Online]. Available: www.immersion.com[14] C. Joslin, A. El-Sawah, D. Chen, and N. D. Georganas, “Dynamic gesture

recognition,” in Proc. IEEE IMTC, 2005, pp. 1706–1711.[15] A. El-Sawah, N. D. Georganas, and E. M. Petriu, “Finger inverse kine-

matics using error model analysis for gesture enabled navigation in virtualenvironments,” in Proc. IEEE HAVE Workshop, 2006, pp. 34–39.

[16] A. El-Sawah, N. D. Georganas, and E. M. Petriu, “Calibration and errormodel analysis of 3D monocular vision model based hand posture estima-tion,” in Proc. IEEE IMTC, 2007, pp. 1–6.

[17] A. El-Sawah, C. Joslin, N. D. Georganas, and E. M. Petriu, “A frame-work for 3D hand tracking and gesture recognition using elements ofgenetic programming,” in Proc. Int. Workshop VideoRec—4th Can. Conf.Comput. Robot Vis., 2007, pp. 495–502.

[18] A. Watt and M. Watt, Advanced Animation and Rendering Techniques.New York: ACM, 1992.

Ayman El-Sawah received the B.Sc. degree in elec-tronic and communications engineering from CairoUniversity, Giza, Egypt, in 1985, the M.S. degree insystems design engineering from the University ofWaterloo, Waterloo, ON, Canada, in 1995, and thePh.D. degree in computer science from the Univer-sity of Ottawa, Ottawa, ON, in 2008.

His research interests include image processing,computer vision, digital signal processing, forwarderror correction, multimedia, hand tracking, and ges-ture recognition.


Nicolas D. Georganas received the Dipl.Ing. degreein electrical engineering from the National TechnicalUniversity of Athens, Athens, Greece, in 1966 andthe Ph.D. degree in electrical engineering (summacum laude) from the University of Ottawa, Ottawa,ON, Canada, in 1970.

He is the Distinguished University Professor andAssociate Vice President, Research (External), withthe University of Ottawa. His research interests in-clude multimedia communications, ambient comput-ing, and collaborative haptovirtual environments.

Dr. Georganas is a Fellow of the Canadian Academy of Engineering, ofthe Academy of Science (Royal Society of Canada), and of the EngineeringInstitute of Canada. He is a Laureate of the 2002 Killam Prize for Engineering,which is Canada’s highest prize for career achievements in research, and hebecame an Officer of the Order of Canada in 2007, which is the highest honorfor a Canadian. He was the recipient of Honorary Doctorate degrees from theTechnical University of Darmstadt, Darmstadt, Germany, in 2004 and from theNational Technical University of Athens in 2007.

Emil M. Petriu (M’86–SM’88–F’01) received theDipl.Eng. and Dr. Eng. degrees from the PolytechnicInstitute of Timisoara, Timisoara, Romania, in 1969and 1978, respectively.

Since 1985, he has been a member of the facultyof the University of Ottawa, Ottawa, ON, Canada,where he is currently a Professor and the UniversityResearch Chair with the School of Information Tech-nology and Engineering. Since 2005, he has been theChief Scientific Officer with XYZ RGB Inc., Ottawa,which is a 3-D imaging company. He has published

more than 280 technical papers, authored two books, and edited other twobooks. He is the holder of two patents. His research interests include robotsensing and perception, interactive virtual environments, human–computersymbiosis, soft computing, and digital integrated circuit testing.

Dr. Petriu is a Fellow of the Canadian Academy of Engineering and ofthe Engineering Institute of Canada. He is an Associate Editor of the IEEETRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT and a mem-ber of the Editorial Board of the IEEE Instrumentation and MeasurementMagazine. He was a corecipient of the IEEE’s Donald G. Fink Prize PaperAward and the recipient of the IEEE Instrumentation and Measurement SocietyAward in 2003. He is currently serving as the Chair of TC-15 Virtual Systemsand the Cochair of TC-28 Instrumentation and Measurement for Roboticsand Automation and TC-30 Security and Contraband Detection of the IEEEInstrumentation and Measurement Society. He was the General Chair ofthe IEEE Instrumentation and Measurement Technology Conference (IMTC),May 2005, Ottawa. He was the Program Chair of the IMTC 1997, Ottawa; theIMTC 1998, St. Paul, MN; the IMTC 2002, Anchorage, AK; the InternationalInstrumentation and Measurement Technology Conference 2008, Victoria,BC, Canada; the IEEE International Workshop on Measurement Systemsfor Homeland Security, Contraband Detection, and Personal Safety (IMS),March 2005, Orlando, FL; the IMS 2006, Alexandria, VA; and the IEEE In-ternational Conference on Virtual Environments, Human–Computer Interfaces,and Measurement Systems (VECIMS), June 2007, Ostuni, Italy. He is theGeneral Chair of the VECIMS, July 2008, Istanbul, Turkey.

A Prototype for 3-D Hand Tracking and Posture...

Documents

Transcript of A Prototype for 3-D Hand Tracking and Posture...