Towards Multilevel Human Body Modeling and Tracking in 3D...

Towards Multilevel Human Body Modeling and Tracking in 3D: Investigation in Laplacian Eigenspace (LE) Based Initialization and Kinematically Constrained Gaussian Mixture Modeling (KC-GMM)

(CSE Department Research Exam Report – August 14, 2008)

Cuong Tran1Advisor: Prof. Mohan Trivedi

Computer Vision and Robotics Research Laboratory

University of California, San Diego

1 Email: [email protected]

ABSTRACT Vision-based automatic human body pose estimation has many potential applications and it is also a challenging task. Together, these two factors have made vision-based human body pose estimation an attractive research area with closely related research areas including body pose, hand pose, and head pose estimation. Up to now, these research works however only deal with each task of estimating body pose, hand pose or head pose separately. In this paper, we bring out the issue of multilevel human body pose estimation and focus on model based methods for articulated human body pose estimation using volumetric data (voxel data). Important steps in this kind of methods will be described and several recent techniques will be analyzed and compared. Based on this analysis, we propose a fairly general method combining the discovered properties of LE transformation in [23] and KC-GMM method [3] for automatic initialization and tracking of both body model and hand model using voxel data. We also propose a framework for human body pose estimation at multilevel (i.e. body pose, hand pose, head pose) in an integrated way. The proposed method and framework will be presented along with experimental support and other possible avenues for future work in the area will be discussed.

Keywords— Vision based, markerless, multi-level human body pose estimation, hand pose estimation, volumetric reconstruction

1. INTRODUCTION Vision-based pose estimation and tracking of articulated human body is the problem of estimating kinematic parameters of the body model (such as joint position and joint angle) from static images or video sequences as the

body’s position and configuration change over time. To make the terms of body pose estimation, hand pose estimation, head pose estimation clear: Body pose regards the articulated body model with torso, head, and 4 limbs but without hand, foot, or detailed head & facial variation; Hand pose regards the articulated multi-DOF (Degree Of Freedom) hand model; and head pose regards the articulated head and face model.

A good body/head/hand pose estimation system has many potential applications including advance Human Computer Interaction (HCI), surveillance, 3D animation, intelligent environment, robot control, etc. Compared to previous technologies using markers or some specific devices, markerless vision-based approaches provide more natural, non-contact solutions. This is however a very challenging task. One major reason is the very high dimensionality of the pose configuration space, e.g. in [3], 19 DOF (Degrees Of Freedom) are used for body model and 27 DOF are used for hand model. Moreover, we also have to deal with other common issues in computer vision like self occlusion, variation in lighting conditions, shadows, object appearance (e.g. different clothes, hair, …).

There have been numerous research studies in this area with over 350 publications over the recent period from 2000 to 2006 [15]. Some surveys of several techniques for articulated body pose estimation can be found in [14, 15, 20, 25] for body, [8] for hand, and [16] for head. These research works however only deal with tasks of estimating body pose, hand pose or head pose one at a time while we see that the ability to estimate the multilevel full body model (e.g. the model consist of body, hands and head) is desirable: First, the combined information at multilevel is useful. For example in intelligent environment, the combination of body pose, hand pose, head pose would give better interpretations of human status/intention. Second, information from different levels can support each

other and help to improve the estimation performance. For this reason, we propose a framework for doing human body pose estimation at multilevel in an integrated way and then the results from each level can be combined into a full model of body, hand and head.

This paper focuses on the subarea of model-based methods for real 3D human body pose estimation using reconstructed voxel data. The remainder of the paper is organized as follows. In Section 2, we put this concerned subarea in a broader context of vision based human body posed estimation. Section 3 is a literature review of the subarea of model-based methods for human body pose estimation using voxel data. Based on this review, Section 4, 5, 6 describe the proposed combined method (for automatic initialization and tracking of body model, hand model from voxel data), the proposed integrated framework (for doing human body pose estimation at multilevel), and our related experimental results. Finally, we have some concluding remarks along with suggestions of possible avenues for future work in Section 7.

2. MODEL-BASED METHODS USING VOXEL DATA

IN A LARGER CONTEXT OF HUMAN BODY POSE ESTIMATION

There were several surveys of vision based human body pose estimation [14, 15, 20, 25, 8], each with different focus and taxonomy. Werghi [25] provided a general overview of both 3D body scanner technologies and approaches dealing with such scanned data, which focus on one or more of the following topics: body landmark detection, body scanned data segmentation, body modeling, body tracking. In Poppe’s survey [18] on body pose estimation techniques, he mentioned the taxonomy of 2D approaches and 3D approaches, depends on the goal to achieve 2D pose or 3D pose representation; the taxonomy of model-based approaches and model-free approaches, depends on whether a priori kinematic body model is employed. Poppe’s survey also split the pose estimation process into modeling process, which is the construction of the likelihood function and estimation process, which is concerned with finding the optimal pose given the likelihood. Moeslund et al. [15] split the body pose estimation process into initialization, tracking, pose estimation, and recognition. In [16], they also provided an updated review of advances in human motion capture for the period from 2000 to 2006. In [8], Erol et al. surveyed techniques for vision based hand pose estimation. They gave the taxonomy of partial pose estimation, which only capture motion of some specific parts (e.g. finger tips), and full DOF pose estimation. They also categorized approaches into model based tracking approach, which makes use of motion history and dynamics from a sequence of frames, and single frame approach, which makes no assumption on time coherence.

Figure 1. Block diagram of a generic human body pose estimation system. Dash line means that the underlying kinematic model can be used or not. Gray boxes show the focus of this paper, which are model based methods using voxel data and aim to extract full 3D posture.

We see that it is not easy to have a unified taxonomy of

the broad area of human body modeling and tracking. In Figure 1, we describe the block diagram of a generic vision based human body pose estimation system, in which we first need some components to extract useful features from input vision data and then a procedure to infer body pose from extracted features. We can loosely categorize related research studies into monocular [10, 12, 17, 21] and multi-view approaches [1, 2, 3, 4, 5, 6, 11, 13, 18, 19, 23, 24]. Compare to monocular view, multi-view data can help to reduce the self occlusion issue and provide more information to make the pose estimation task easier as well as to improve the accuracy. Among multi-view approaches, some methods use 3D features reconstructed from multiple views [1, 2, 3, 4, 5, 6, 13, 18, 23, 24], e.g. volumetric (voxel) data, while others still use 2D image features [11, 19], e.g. color, edges, silhouette. Because the real body pose is in 3D, using voxel data can help avoiding the repeated projection of 3D body model onto the image planes to compare against the extracted 2D features. Furthermore, reconstructed voxel data help to avoid the image scale issue. These advantages of using voxel data allow the design of simple algorithms and we can make use of our knowledge about shapes and sizes of body parts. Although other image features like color, texture are commonly not taken into account in voxel based approaches, several methods of this kind indicate that using voxel data only could be sufficient for human body pose estimation. To reconstruct voxel data, of course we have to pay an additional computational cost. Fortunately, efficient techniques for this task have been developed [1, 4, 5, 22].

Another input used in many methods is a predefined kinematic model of the human body. These methods called model-based methods, in which there is an underlying kinematic model and a procedure to fit that model onto real input data. There are also model-free methods, which assume no underlying kinematic model and contain

2

procedures to learn a direct mapping from feature space to pose configuration space. Although information from an underlying kinematic body model can help to improve the accuracy and robustness, the advantage of model-free methods is that they do not suffer from (re)initialization issue and can be used for initialization of model-based methods.

Regarding the pose estimation output, two types of research directions have emerged. One only aims to extract high-level abstract information corresponding to motion and posture of the body, which can then be applied for gesture classification for example. The other aims to recover the real (full) 3D motion and posture of human body. The latter one is more challenging but it is also worth dealing with, because it provides more general, principled methods that can be adapted to extract different high-level abstract information depending on application area. Moreover, various types of interaction styles and applications also explicitly rely on the full 3D pose information.

Figure 2. Flowchart of common model-based methods for articulated human body pose estimation using voxel data. Dashed boxes mean that some methods may or may not have all of these steps. The initialization/segmentation may be called at each frame or just be called when we need to initialize or re-initialize body model during the tracking. A summary of steps contained in some selected methods is shown at the bottom.

In the following section, we have a literature review of model-based methods for real 3D human body pose estimation using reconstructed voxel data. Based on this review, we propose a combined method that make use of voxel segmentation in LE [23] and KC-GMM method [3] for automatic initialization and tracking of both body model and hand model from voxel data.

estimating body pose in current frame). The first two steps are common for all methods of this kind while among the last three steps different methods may touch different combinations of these steps. A summary of which steps are contained in some selected methods is shown at the bottom of Figure 2.

3. LITERATURE REVIEW OF MODEL-BASED

METHODS FOR HUMAN BODY POSE ESTIMATION USING VOXEL DATA

In this kind of methods, the input is the voxel data of human body V, which is reconstructed from multiple camera views. Normally, V is a binary 3D matrix representing a predefined space of interest where we expect to see the human body. A voxel in V with value 1 means that that voxel belongs to the concerned subject. The output is the 3D estimated pose configuration of human body X, which contains information about the absolute position (μ) and orientation (rotation matrix R) of each body component ( ){ } niii RX :1, == μ . A more convenient way to represent X is the twist framework (see Section 3.1), which use a global position and orientation (e.g. the position and orientation of torso) and a sequence of relative angles between body parts in a hierarchy (e.g. torso -> upper arm -> lower arm)

In data capture and voxel reconstruction steps, there are some human body scanner technologies mentioned in [25], in which we are concerned with vision-based technique to reconstruct voxel data of the body from multiple-perspective cameras. A common approach to do this is the shape-from-silhouette (visual hull) approach. First, the images from multiple synchronized cameras are segmented into object silhouette using some background subtraction techniques [7, 9]. Then some efficient shape-from-silhouette techniques [4, 5, 22] can be used to retrieve the 3D voxel data. There is also another approach called shape-from-photo consistency (photo hull) that uses other features (not just the silhouette) from the photos to have more accurate geometry of the reconstructed photo hull. In case of body pose estimation, the more accurate geometry of voxel data is not necessary so using visual hull is more appropriate (should be faster and more robust to noise). Typically, voxel data reconstruction and pose estimation for each subject of body, hand, or head are done separately. This issue will be discussed more in Section 5, followed by a proposed framework for doing human body pose estimation at multilevel in an integrated way. The result from each level of body pose, hand pose, head pose can then be combined

{ })(,)2(,211 ,...,,, nparentnparentRX θθμ= . Figure 2 shows a typical flowchart of common model-

based pose estimation methods using voxel data. There are five main steps: camera calibration/data capture, voxel reconstruction, initialization/segmentation (segment voxel data into different body parts), modeling/estimation (estimating pose using current frame only), and tracking (use temporal information from previous frames in

3

Authors Initialization Model Estimation procedure Evaluation Comment

Mikic ‘03 [13]

Grows model over data from head to torso, to limbs

Ellipsoidal, cylindrical components. Described with twists

Tracking-based method. Uses Extended Kalman Filter to predict next pose then update using growing procedure & Bayesian Networks

Visual only Fully automated but lack generality (no hands)

Cheung ‘03 [5]

CSP alignment & segmentation via motion clustering

Skeletal body model

Uses Colored Surface Point (CSP). Hierarchical Segmentation/ SFS alignment to recover motion, shape and joint

Synthesized ground-truth

Fully automated but no hands

Fairly general, Fully automated, however sensitive to noise in voxel data

Sundaresan ‘07 [23]

Voxel segmentation in LE and a procedure for body parts registration

6-chains representation of body. Superquadric components in body model

Segment voxel data in Laplacian Eigenspace (LE). Probabilistic register segmented voxel to body parts then estimate skeletal and superquadric parameters


Caillette ‘08 [1]

A priori known skeletal model. Initialize with K-mean blob fitting procedure

Skeletal body model & Gaussian blobs

Tracking-based method. Break complex movement into basic motions Use Variable Length Markov Model (VLMM) to predict candidate pose. Evaluate with blob fitting. Use colored voxel for more robust tracking.

Manually annotated ground-truth

Fast (real-time) but limited to trained movement sequences

Table 1. Summary of selected model based methods using voxel data that have been applied for body modeling only


Ogawara ‘03 [18]

A priori known model and initial pose

Surface and skeletal hand model

Uses ICP and M-Estimators to fit hand model in 3D space (3DTM algorithm)


Little evaluation & Fixed Model

Little evaluation & Fixed Model

Ueda ‘03 [24]

A priori known model

Surface and skeletal hand model

Torque forces generated from differences between model surface points and real voxel data (ICP)


Table 2. Summary of selected model based methods using voxel data that have been applied for hand modeling only


Delamarre ’01 [6]

A priori known model with manual initial placement

3D primitive shape model (truncated cones, spheres, parallelepipeds components)

Tracking-based method (Kalman Filter). Uses physical forces, a simpler form of Iterative Closest Point (ICP), to make the 3D model and the voxel data intersect

Visual only Not really a method for estimating pose with voxel (only using 2 views epipolar geometry reconstruction) -> limitation due to possible ambiguities.

4

Cheng ‘07 [3]

Manual initialization of body parts dimension & initial pose

Ellipsoidal components. Described by Kinematically Constrained Mixture Model (KC-GMM)

Integrate kinematic constraints in KC-GMM model. Derive EM algorithm with KC-GMM for pose estimation (no additional projection step)

Synthesized & Marker-based motion capture for ground-truth

Fairly general, applied for both body and hand. However it require a manual initialization

Table 3. Summary of selected model based methods using voxel data that have been applied for both body and hand modeling. The method in [6] is however not really a method for estimating pose in volumetric space and it has little evaluation

into a full body model.

The modeling and tracking steps can be considered as a mapping from input space of observed voxel data V and information in the predefined model (e.g. kinematic constraints) C to the body model configuration space X.

XCVM a),(: The body model configuration contains both static

parameters (i.e. shape and size of each body component) and dynamic parameters (i.e. position and orientation of each body component), in which the static parameters are estimated in the initialization step, using information in the predefined model. Some methods have automatic initialization step like [13, 23] while others require a priori known or manually initialized static parameters. The main differences between methods of this kind are in the body model that they use and how they implement the mapping procedure M. Methods that have modeling step but no tracking step are also called single frame-based methods while methods with tracking step are called tracking-based methods. Because the tracker in tracking based methods would be lost over long sequences, multiple hypotheses at each frame can be used to improve the robustness of tracking. Single frame based approach is a more difficult issue because it does not make any assumptions on time coherence. However, we see that this kind of approach is needed for initialization or reinitialization of tracking-based methods. Regarding the generality, some methods are more general and can be applied for both body model and hand model while others are specific for only one type of model. Regarding the evaluation step, some methods only have visual evaluation while others have both visual evaluation and quantitative evaluation using ground-truth data got from synthesized data, manual annotation or maker-based motion capture system.

According to the factors mentioned above, a summary of several recent model based methods for human body pose estimation using volumetric data is shown in Table 1-3. Table 1 is about several recent noticeable methods that have been applied for modeling and tracking of body only, which is the most intensive subfield. Table 2 is about methods that have been applied for modeling and tracking of hand only. There are not many methods of this kind and one reason is that it is hard to attain good voxel data for hand because concave shape of hand leads to severe occlusion. It should

be mentioned that hand pose estimation has close relationship to body pose estimation, e.g. methods in [18, 24] have the same nature of using physical force generated with Iterative Closest Point (ICP) as method in [6]. Table 3 is about methods that have been applied for both body and hand modeling. There are very few methods of this kind and in the two mentioned methods, the method in [6] also does not really estimate pose in volumetric space. It only use epipolar geometry reconstruction from 2 views, which will limit the estimation result due to several possible ambiguities. In the following section, we will disscuss in more details some of these methods to emphasize their important results and limitations of each one. 3.1. Method using hierarchical pose estimation and Extended Kalman Filter pose prediction for body modeling and tracking [13] This method was highly rated in [25]. It is a tracking based, fully automated method, and it provides good visual result for body pose estimation with several complex motion sequences. The body model used in this method is shown in Figure 3.(a), where torso is represented by a cylinder and other body parts are represented by ellipsoids. Prominent features in this method are the use of Twist framework (Murray et al., 1993) for describing kinematic chains and a hierarchical pose estimation procedure.

In twist framework, the location of a point on the body is represented by its initial location and a chain of hierarchical relative movements that affect the position of that point, e.g. for a point in the lower arm, the chain of movements includes the movement of the torso, the relative movement between the upper arm and the torso, and the relative movement between the lower arm and the upper arm. There are several advantages of using this framework. First, it results in a non-redundant set of model parameters. Second, the relationship between the model parameters and the locations of specific points is simple. And finally, most of the constraints that ensure physically valid body configurations are now inherent to the model.

This method does both body model acquisition (initialization) and tracking automatically. Model acquisition step is done by a hierarchical growing procedure, starting by locating head using its specific shape

5

and size. They use a crust with the inner and outer diameters corresponding to the minimum and maximum size of a head to search for the head. The head is found when the inner sphere is filled with voxel and the number of surface voxel in the outer sphere is maximum. After locating head position, the procedure grows to locate and initialize torso, and then limbs from remaining voxel data. We see that this growing procedure follows the divide and conquer principle to reduce the search space. The initialized estimation from the growing procedure is refined using a Bayesian Network to take into account general knowledge of human body proportion.

For tracking, Extended Kalman Filter (EKF) is used to predict body configuration in next frame from results in previous frame (tracking-based method). The used state transition equation and measurement equation are:

][]][,[][][][][]1[

kwkxkhkzkvkxkFkx

+=+=+

In which the state x is the pose configuration state (twist framework representation) . The measurement z includes centroids and endpoints of different body parts (23 points in total). The noises v[k], w[k] are zero-mean Gaussians. The function h[x] is the chains of relative rotation to determine the measurement points from a given pose configuration x. The state transition matrix F is set to the identity matrix. Although this is a “poor” transition model, we can still obtain acceptable result if the measurement is good. Using EKF algorithm we can compute the posteriori estimate of x given the measurement z and the priori estimate of x (computed from the state transition equation). This predicted pose configuration state x is then updated using a similar growing procedure described above.

TTTtx ],...,,,[ 16000 θθω=

The advantage of this method is that it is fully automated and it can track even for large displacements with the growing procedure. Because of using specific shape and size information about head and torso, it however lacks generality to be extended to track other articulated body like hand. 3.2. Method using Kinematically Constrained Gaussian Mixture Model (KC-GMM) for both body and hand pose estimation [3] This is one of very few methods that have been applied (with experimental result) for both body models and hand models. Among several methods competing in the Workshop on Evaluation of Articulated Human Motion and Pose Estimation - CVPR EHuM2 2007 (including [3, 19]), this method won the first prize.

The hand model and body model used in this method are shown in Figure 3.(b). For hand, there are 16 components with 27 DOF (degree of freedom). For body, there are 11 components with 19 DOF. The pose estimation

procedure of this method uses the same paradigm of probabilistic clustering. Each body component is described by a Gaussian ( )ii ∑,μ where the mean iμ is the position and the covariance matrix contains information about the dimension and orientation. The set of such Gaussian components are kinematically constrained according to the predefined model. The goal is then to estimate optimal value for the Gaussian Mixture Model (GMM) under those kinematic constraints. They represent these kinematic constraints by 3 equations corresponding to 3 types of constraint: spherical (3 DOF) constraint, hardy-spicer (2 DOF) constraint, and revolute (1 DOF) constraint.

Tiiii RR Λ=∑

)()( 00 jijjijiis aRaRc +−+=Θ μμ

jijijih qRqRc 00)( ×=Θ

jijijir qRqRc 00)( −=Θ whereΘ is the embodiment of the kinematic constraints and all configuration parameters, ji μμ , are the means of components i and j, R0i, R0j are the rotation of the components relative to the world coordinates, are

the joint positions in component coordinate frame (the origin is at the component center), are the rotation

axes of each component in either component coordinate frame. We can interpret these equations as follow: cs = 0 means two joints on two component are coincided, we have 3 DOF constraint; ch = 0 means 2 rotation axes are perpendicular, combined with cs = 0 we have 2 DOF constraint; cr = 0 means 2 rotation axes are aligned, combined with cs = 0 we have 1 DOF constraint.

jiij aa ,

jiij qq ,

In a previous work of the same authors [2], these constraint equations are satisfied by adding a constraining step (C-step) into EM algorithm for Gaussian mixture estimation. However this additional C-step may compete with the M-step and cause instability in the optimization. The primary contribution of [3] is the removal of this C-step by incorporating kinematic constraints into the probability model in the form of a prior probability to have a Kinematically Constrained Gaussian Mixture Model (KC-GMM)

∏ ∑

∏

⎥⎦

⎤⎢⎣

⎡ΘΘ=

ΘΘ=Θ

n znnn

nn

n

zPzyPcP

cyPcPcYP

)(),|()|(

),|()|()|,(

where Y={yn} represent the distribution of input voxel data (generated by a mixture of Gaussians) and zn be the hidden membership variable. The EM algorithm for this new probability model is then derived for maximum likelihood estimation of Gaussian component parameters, which can be interpreted to body configuration.

6

This method is quite general and was applied successfully for both body and hand. However, this method is not fully automated because it requires a careful manual initialization step. This is obviously an obstacle if we want to use this method in real time application. Another issue of KC-GMM method is that due to the nature of EM algorithm, it could stuck in a sub-optimal solution especially when there is a large displacement. 3.3. Method using Monte Carlo Bayesian framework with Variable Length Markov Models (VLMM) pose prediction scheme followed by a Gaussian blob fitting procedure for body modeling and tracking [1] Due to the complex, high dimensional model of articulated body, running speed of pose estimation algorithm is really an issue. Real-time performance is a prominent goal of this methods and this is one of very few methods that has reported the run-time performance.

The body model used in this method is shown in Figure 3.(b), which consists of a skeletal model and Gaussian blobs attached to bones of this skeletal model. In this method, the color information is also kept along with each voxel, which allows more robust tracking.

This method does body tracking based on a Monte Carlo Bayesian framework. The posterior distribution P(X|Z), where X is the 3D pose configuration parameters and Z is the available measurement from observed voxel data, is estimated by: generating a set of samples (particles) in pose configuration space; propagating particles using motion prior (prediction), weighting particles by likelihood evaluation, and then finding the optimal pose configuration based on the weighting function. This kind of particle filter approach provides a global optimal solution so there is no local suboptimal solution issue like in KC-GMM method. However because of the very high dimensionality of body pose configuration space, the “sufficient” number of particles to accurately represent the true posterior distribution P(X|Z) could be very large, which leads to a computational issue. In this method [1], they address this pitfall by using Sample Importance Resampling (SIR) for generating particles; using VLMM prediction scheme for propagating particles; and using Gaussian blob fitting procedure for a faster likelihood evaluation.

For more accurate and efficient hypothesis propagation of particles, complex human activities such as dancing are broken into elementary movements using Gaussian clustering in the feature space. For each body pose, the corresponding feature vector Ft = (xt, xt’) consisting of joint angle vector xt (pose configuration) and its first derivative xt’. The global position and orientation are dropped because we do not want the learnt behaviour model to be sensitive to them. The first derivative is included to help resolving ambiguities in pose configuration space. Human body behaviour will correspond with a smooth trajectory in the

feature space and by clustering this feature space, we can break complex human activities into elementary movements for which the local dynamic models are easier to infer.

Beside the local dynamic within a cluster, complex human activities also require transition between clusters and this transition is predicted using VLMM. The issue with fixed d-order Markov Model is that while low d often gives poor results, higher order leads to computational issue and could be impractical (the memory and training set size requirements grow exponentially with d). With VLMM, we can fit a higher Markov in “needed” contexts, while using lower order Markov elsewhere. Such “needed” contexts are determined by the training data. Regarding the problem of predicting transition between motion Gaussian clusters, a VLMM can be thought as a probabilistic finite state automaton ( )sKQM ,,,, γτ= where K is the finite alphabet of the VLMM (Gaussian cluster labels are used as alphabet); Q is a finite set of model states (the memory for conditional transition of the VLMM), which are strings in K of length at most NM (NM > 0); the transition function τ, the output probability function γ for a particular state, and the probability distribution s over the start state are defined as

QKQ →×:τ [ ]1,0: →×KQγ [ ]1,0: →Qs A sequence of human body behaviour is represented as a sequence of Gaussian cluster labels and can be used for training the VLMM. After training, traversing M will generate statistically plausible examples of the behaviour. The particles propagation is then done using both VLMM cluster transition and local dynamics within a cluster (by sampling a local noise vector).

The evaluation of the likelihood can be done faster with the proposed Gaussian blob fitting procedure with reconstructed voxel data: By reconstructing voxel data, we can avoid repeated projection between 3D appearance model and the image planes for comparison; By using Gaussian blob is a more compact representation for comparison between candidate pose configuration and the current observation. This blob-fitting procedure can also detect tracking failures, e.g. the best achieved likelihood is below a threshold. A reinitialization can then be requested then by performing blob fitting from all clusters, which however might provoke a considerable lag.

The experimental results of this method indicate an improvement compared to some other standard particle filter based algorithms [1]. The runtime performance was also reported with the total time of both voxel reconstruction and body pose inference is around 50-110 milliseconds depending on the configuration parameters. However the performance of this method depends largely on the correctness of the prediction result, which means that it requires a good training phase and it could only work well with some specific types of trained movements. Current implementation of this method does not handle case of new movements that are previously unseen in the training data.

7

Figure 3. Some body/hand models used in analyzed methods

Figure 4. Steps of the combined method for automatic body/hand model initialization and tracking

Figure 5. (a)- A simple procedure for body/hand model initialization

(b)(c)(d)-Body/hand model initialization result

8

3.4. Laplacian Eigenspace (LE) based method for body modeling [23] This is a kind of skeletonization method that obtains the skeletons of individual articulated body chains. The voxel segmentation technique in this method is quite general and can handle poses where there is self contact, i.e. when one or more limbs touch other body parts. In this method, first the voxel data is segmented into 6 chains representing the body (torso, head, and 4 limbs). Based on this segmented result, more detailed skeletal model and superquadric model representing the body are estimated. These representations (6-chain, skeletal and superquadric model) of body are shown in Figure 3.(c).

A main contribution of this method is to discover the properties of Laplacian Eigenspace (LE) transformation after comparison of several manifold techniques (Laplacian, Isomap, MDS…). By mapping into high dimensional (e.g. 6D) LE, voxel data of body chains like limbs, which have their length greater than their thickness, will form an 1-D smooth curve which can then be used to segment voxel data into different body chains. The procedure for LE mapping is as follows: First, we compute the adjacent matrix W of voxel data, such that Wij = 1 only if voxel i is a neighbor of voxel j. Then, we compute D matrix, so that

and Dij = 0 for i≠j. The first d

eigenvectors of L=D-W with minimum eigenvalues give us the d basis of the needed LE. After mapping into LE, there are several 1-D curves corresponding to different body chains. A spline fitting process is used to segment the curves which results in the segmentation of their respective body chains.

∑ ==m

k ikiiWD

1

Segmented voxel clusters are then registers to their actual body chain using a probabilistic registration procedure with some assumptions about the connection type of each body chain and the length/thickness relation between body chains. After registration, they estimate the skeletal and superquadric model of the body using a hierarchical procedure (starting with torso and then limbs) which takes into account general knowledge about human stature.

They did experiment with several synthesized and real body voxel data. In their experiment, the LE-based voxel segmentation performed quite well and it also performed successfully in case of self contact, which was not addressed in some previous voxel segmentation algorithms. Their experimental result with HumanEvaII dataset however was not good (only around 9% of the total frames were successfully segmented and registered), which indicates the sensitization of LE based segmentation step to voxel noise and this will affects the whole subsequent steps.

4. A GENERAL ARTICULATED BODY POSE ESTIMATION METHOD: COMBINING

INITIALIZATION AND MODELING As discussed above, a desired improvement of KC-GMM method [3] is an automated initialization step. A possible solution is to use hierarchical growing procedure for body model acquisition [13]. In doing so, however, we will lose the generality of KC-GMM method and cannot apply it to hand model. The voxel segmentation using LE transformation in [23] has the generality so it would be a more appropriate choice for improving KC-GMM method with an automated initialization step. Regarding LE based method for body modeling [23], we know that the voxel segmentation step is sensitive to noise and failure in this initial step will affect the whole process. This motivates the idea that instead of doing voxel segmentation at every frame, we only use it for initialization purpose. In subsequent frames, a tracking method that exploits temporal information could help in overcoming the sensitization to noise to some extent. Concrete steps of the proposed combined method are shown in Figure 4, in which LE transformation is applied to segment body/hand voxel at an initial specific pose, which clearly reveals the body/hand’s structure, into different parts (e.g. limbs in body model and fingers in hand model). This segmentation result is then used to fill the gap of an automated body/hand model initialization for the KC-GMM method. 4.1. Implementation Among 4 steps of the proposed combined method shown in Figure 4, the voxelization (step 1) and the KC-GMM tracking (step 4) come from the previous working results of Dr. Shinko Cheng when he was at Computer Vision and Robotics Research (CVRR) lab. What I did are the implementation of LE based voxel segmentation [23] (step 3) and apply it to hand case; and the implementation of initialization from segmented voxel (step 3) to fill the gap of an automated initialization for KC-GMM method.

In applying voxel segmentation procedure in LE [23] to hand case: Because fingers also have their length greater than their thickness, they will form 1-D smooth curves in LE and we can apply the spline fitting in LE to segment hand voxel data as we do with body voxel data. In case of hand, however, the palm would not form a good 1-D curve in LE due to its nearly square shape so we only use spline fitting to segment voxel of 5 fingers and the remaining voxels are considered palm’s voxels. The segmentation results of applying spline fitting process in LE for both body voxel and hand voxel are shown in Figure 4.(b).

In the automatic initialization step, we have segmented body/hand voxel and a template of body/hand model, which contains a predefined body/hand structure (i.e. the number of components, the number of joint, and the number of DOF

9

at each joint) and we want to adjust necessary parameters including components dimensions, joints position and joints angles to achieve a model that fit the segmented voxel well. Because we require the body/hand to start at a specific pose (i.e. stretch pose), this initialization step can be done with following simple and fast procedures (Figure 5.(a)).

Hand model initialization: • For fingers registration, we first compute the center of

each segmented voxel region (marked with plus sign). By constructing lines from palm center C_palm (which we have already known) to all other centers, we see that the angle characteristic of the line from C_palm to C_thumb is distinguishable and can be used to register which voxel region is thumb (i.e. the minimum angle to all other lines is the largest). Other voxel region can then be consequently registered to the remaining fingers.

• The local z-axis of each finger is computed as the largest PCA component of corresponding voxel region (marked with arrows). Because all fingers are in the same plane in the stretch pose, we can use found local z-axes to compute local x-axis and y-axis for each finger. From these local axes, we can compute the orientation and the dimension of each finger (by projecting voxel region on each local axis and finding the range). With the assumption that each finger consists of 3 equal segments, all joint positions can also be found (e.g. red circles).

• For the palm, the local palm z-axis is determined by the line from C_palm to the “lowest” joint of the middle finger. Other palm parameters are then computed similarly as described above.

Body model initialization: • For body parts registration, center of each segmented

voxel region is computed. Two lowest parts (in z-direction) are registered as two legs. The remaining biggest part is torso. The remaining part that is closest to the torso center in xy-plane is head and the other two parts are two arms. The left/right assignment is done with the assumption that arms are kind of tending to the front direction.

• The local z-axis of torso and each limb are computed as the largest PCA component of corresponding voxel region (marked with arrows). The local z-axis of head is set to be the same as the local z-axis of torso. Similar to hand case, we use found local z-axes to compute local x-axis and y-axis for each body part and then compute the orientation and the dimension of each body part. With the assumption that each limb consists of 2 equal segments, all joint positions can also be found.

The result of body/hand model initialization is shown in Figure 5.(b)-(d). After initialization, the KC-GMM method [3] is used to continue inferring body/hand pose in

Figure 6. Integrated framework for human body pose estimation at multilevel in order to elaborate full model of body, hand, and head

subsequent frames.

5. AN INTEGRATED FRAMEWORK FOR HUMAN BODY POSE ESTIMATION AT MULTILEVEL

Although the ability of achieving a full model of human body (including body, hand, and head) is desirable, as far as we are concerned there is currently no published work that mentions this issue. There are some possible reasons for this fact. First, the high dimensional problem mentioned above becomes extremely more serious with the estimation of full body model, e.g. for a full model of body with 2 hands, we have 19 DOF (of the body) + 2*27 (of two hands) DOF = 73 DOF. We may say that this is an explosion in pose configuration space, so applying current methods to solve this huge problem in one shot will be very inefficient or

10

even impossible. Second, the difference in size between body and hand leads to difficulty in achieving good data (e.g. voxel data) as well as estimating the pose of both body and hand in one shot. In our proposed framework (Figure 6), we use different camera arrays for body, hand and head to be able to have good data for each task. The huge problem of estimating full model of human body is still broken into different tasks of estimating body pose, hand pose and head pose. These tasks are done separately therefore the issue of searching space explosion will not arise. The novel thing is that we do calibrate these different camera arrays into the same extrinsic world coordinates and capture input data for body, hand and head simultaneously so that, at the final step, we can have a module to combine the achieved results of each task into a full model of human body.

Besides the common issues in each task such as high-dimensional issue (i.e. even within each task the number of DOF is still high), occlusion issue, rapid motion issue (e.g. hand motion can be very fast), real-time requirement issue, this framework bring out some other issues that we need to solve. First, the calibration between different camera arrays is more difficult. Second, because most of methods for articulated body pose estimation from voxel data only work when the given voxel data is of the concerned object only, we need to segment the voxel data for each concerned body part (e.g. segment hand voxel data from voxel data of lower arm). In our experiment, we tried to deal with these issues to some extent. However these issues remain unsolved in general and require more research studies. Following this framework, we have currently done the combination of body and hand into a full model. The combination of head pose will be part of our follow-up work.

6. EXPERIMENTAL RESULTS 6.1. Experimental setup We apply the above combined method to both body and hand modeling and tracking. The experiment is also done according to our integrated framework for body and hand. The experimental setup is shown in Figure 7. There are 2 camera arrays: The first one consists of 4 color cameras on the ceiling to capture body data and the second one consists of 3 thermal cameras on the front wall to capture hand data. These two camera arrays are calibrated to the same extrinsic world coordinates and the data for body and hand are captured simultaneously. There are some issues as mentioned in Section 4 that we have to face with when applying this integrated framework. First, we currently use Caltech Matlab toolbox for calibration so in order to calibrate the two camera arrays to the same world coordinate, we have to configure the two camera arrays so that they can see a common sub-region on the floor. This camera configuration, however, leads to some limitations,

i.e. when capturing hand data, the hand needs to be in some limited positions in order to reduce the ambiguity in voxel reconstruction. Regarding the issue of segmenting hand voxel data from lower arm voxel data, we use thermal cameras for hand so the background subtraction for hand region can be done easily by setting upper and lower thresholds respective to the temperature range of skin.

The experimental scenario is that a person comes into the room and moves around. He then goes to the front screen and put his hand into the region that thermal cameras for hand can capture data. The body modeling & tracking and hand modeling & tracking are then done separately but their results can be combined into a full model of both body and hand. 6.2. Experimental result for body/hand modeling and tracking The template model for body and hand (defining the number of components, the number of joint, and the number of DOF) is the same as described in [3]. For body, we experiment with real body voxel data acquired from 4 color cameras on the ceiling. For hand, we experiment with both synthesized hand voxel and real hand voxel. The synthesized data is constructed using the same hand model shown in Figure 3.(b) and it simulates a wave pattern moving. Real hand voxel is acquired from 3 calibrated thermal cameras on the front screen. The background subtraction with thermal images is simple with an upper and a lower threshold respective to the temperature range of skin. Real hand voxel is then reconstructed using shape-from-silhouette technique.

The visual results of automated body/hand model initialization and tracking are quite good as shown in Figure 8 and Figure 10. With synthesized data, we also have quantitative result of angular and position error of hand components (Figure 9). We see that those plots are periodic with peaks at times when the hand is in nearly closed fist pose. However the error reduces when the hand opens and we do not lose track. To have some comparison, Figure 12.(a) shows the hand voxel segmentation result in case of the nearly closed fist pose mentioned above. We see that the voxel segmentation fails which means if hand pose estimation method relies on voxel segmentation, it will also fail in this case. Figure 12.(b) show an incorrect estimated pose of KC-GMM when the hand model is manually initialized with correct scale (dimension size) but incorrect orientation. This incorrectness is conceivable because of the nature of EM algorithm make it easily to get stuck in some sub-optimal solution. We meet the same issue when there is a large displacement between frames. These results imply that our automated model initialization works well and it is definitely a better replacement of the previous manual initialization step.

11

Figure 7. Experimental setup: Left, cameras positioning. Right, a scene of the experimental room

Figure 8. Visual result of hand modeling and tracking using the combined method. With synthesized data, we can also produce quantitative result likes joint position errors, joint angle errors

Figure 9. Quantitative result of synthesized hand data: Left - Angular error, Right - Position error

At a frame, errors are computed for each body component and the min, max, mean errors are in the context of body component error

12

Figure 10. Visual result of body modeling and tracking using the combined method

Figure 11. Visual result of combining achieved body model and hand model into a full model

Figure 12. Results for comparison with successful tracking results in Fig. 8: (a) Voxel segmentation [23] failed in the nearly closed fist pose (fingers are not well separated) and (b) KC-GMM method [3] failed without careful manual initialization (initialized with correct scale but incorrect orientation). 6.3. Combine results of body and hand into a full model Because we do the experiments according to our integrated framework, the procedure to combine resulted body model and hand model into a full model is straightforward. In our experiment, this combination process works quite well although there is sometimes a small mismatch in position of hand model and position of lower arm in body model. This could be the result of errors in body and hand model initialization and tracking. In our experiment, we have done a simple adjustment by constraining the wrist joint on the lower arm and wrist joint on the hand to be the same. The combined result of full model of body and hand is shown in Figure 11. The first three images are when the person moves around in the room. The last two images are when the

person goes to the front screen to capture hand data, the resulting body model and hand model are now combined into a full model. 7. CONCLUDING REMARKS AND FUTURE WORK

In this paper, we focus on the sub-area of model-based method for real human body pose estimation using volumetric data. After a review and analysis of several previous methods, we proposed a method that combines the KC-GMM method for articulated body pose inference [3] and the spline fitting method in LE for articulated body voxel segmentation [23] to have an automated human body model initialization and tracking system. Our experiment shows that this combination provides better results than using each previous method separately. We also proposed a framework for doing human body pose estimation at multilevel (i.e. body level, hand level, head level) in an integrated so the pose estimation results from each level can then be combined into a full model. This multilevel pose estimation is an interesting topic and we have just started working with it. There is a lot of room for improvement in our current implementation. We want a better calibration procedure to make the configuration of camera arrays more flexible, which will help in reducing ambiguity in voxel reconstruction and therefore providing better voxel data. Currently, the process of body/hand modeling, tracking and then combining into a full model is done offline so an online implementation is desirable. The mentioned issues of calibration between different camera arrays and voxel

13

segmentation of each concerned body parts are unsolved in general and require more research studies. Another obvious follow-up work is to implement and combine the head pose estimation module.

Other possible avenues for future work in the area include the issues of pose estimation and tracking multiple objects simultaneously as well as improving performance & robustness of current pose estimation methods. Regarding the latter one, we may think of several possible directions. First, we can keep trying to combine good characteristics from different methods to have a more robust one. For example, we may want to incorporate some kind of prediction information as done in [1, 13] to the proposed combined method. Second, we can find some way to use both 3D voxel feature and 2D features. In [1, 5], they have associate color information to voxel data. Other 2D features like edges, appearance model, etc… should be also useful. Regarding the major difficulty of high-dimensional body pose configuration space, we can also exploit the divide and conquer principle by trying to break the problem into smaller dimensional ones like the hierarchical estimating of body pose in [13] (detect head first, then torso and so on) or the breaking of complex human movement into basic motions in [1].

8. ACKNOWLEDGEMENT The above experiment was developed from previous working results of Dr. Shinko Cheng when he was at Computer Vision and Robotics Research (CVRR) lab. I would like to thank my advisor Prof. Mohan Trivedi and colleagues at CVRR, especially Dr. Shinko Cheng for invaluable advices and assistances.

REFERENCES [1] F. Caillette, A. Galata, T. Howard, “Real-time 3-D human body tracking using learnt models of behaviour”, Computer Vision and Image Understanding (109), 2008. [2] S. Cheng, M. Trivedi, “Multimodal Voxelization and Kinematically Constrained Gaussian Mixture Model for Full Hand Pose Estimation: An Integrated Systems Approach”, IEEE Int. Conference on Computer Vision Systems, pages 34-42, 2006. [3] S. Cheng, M. Trivedi, “Articulated Human Body Pose Inference from Voxel Data Using a Kinematically Constrained Gaussian Mixture Model”, In CVPR EHuM2: 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007. [4] G. Cheung and T. Kanade, “A real-time system for robust 3D voxel reconstruction of human motions”, In IEEE Proc. Computer

Vision and Pattern Recognition Conference (CVPR), pages 714–720, 2000. [5] G. Cheung, S. Baker, and T. Kanade, “Shape-from-silhouette of articulated objects and its use for human body kinematics estimation and motion capture”, In IEEE Computer Vision and Pattern Recognition Conference (CVPR), 2003. [6] Q. Delamarre and O. Faugeras, “3D articulated models and multiview tracking with physical forces”, Computer Vision and Image Understanding, 81(3):328–357, 2001. [7] A. Doshi, M. Trivedi, “Hybrid Cone-Cylinder Codebook Model for Foreground Detection with Shadow and Highlight Suppression”, Proc. IEEE International Conference on Advanced Video and Signal based Surveillance, Nov 2006. [8] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, X. Twombly, “Vision-based hand pose estimation: A review”, Computer Vision and Image Understanding (108), No. 1-2, October-November 2007, pp. 52-73. [9] T. Horprasert, D. Harwood, and L. S. Davis, “A statistical approach for real-time robust background subtraction and shadow detection”, In IEEE Proceedings ICCV Frame-Rate Workshop, September 1999. [10] E. Hunter, “Visual Estimation of Articulated Motion using Expectation-Constrained Maximization Algorithm”, PhD thesis, University of California, San Diego, 1999. [11] D. Knossow, R. Ronfard, R. Horaud, “Human Motion Tracking with A Kinematic Parameterization of Extremal Contours”, International Journal of Computer Vision, vol. 79, pages 247-269, 2008. [12] M.W. Lee, R. Nevatia, “Human Pose Tracking in Monocular Sequence Using Multi-level Structured Models”, IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), 2008. [13] I. Mikic, M. Trivedi, E. Hunter, P. Cosman, “Human Body Model Acquisition and Tracking using Voxel Data”, In IEEE Int. Journal of Computer Vision (IJCV), pages 199-223, July 2003. [14] T. Moeslund and E. Granum, “A survey of computer vision-based human motion capture”, Computer Vision and Image Understanding, 81(3):231–268, 2001. [15] T. Moeslund, A. Hilton, and V. Kruger, “A survey on advances in vision-based human motion capture and analysis”, Computer Vision and Image Understanding, pages 90-126, 2006. [16] E. Murphy-Chutorian, and M. Trivedi, “Head Pose Estimation in Computer Vision: A Survey”, IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), to appear, 2008. [17] E. Murphy-Chutorian and M. Trivedi, “HyHOPE: Hybrid Head Orientation and Position Estimation for Vision-based Driver Head Tracking”, Proc. IEEE Intelligent Vehicles Symposium (IV 2008), Eindhoven, The Netherlands, Jun. 2008.

14

[18] K. Ogawara, K. Hashimoto, J. Takamatsu, and K. Ikeuchi, “Grasp recognition using a 3D articulated model and infrared images”, In IEEE/RSJ Proceedings of Conference on Intelligent Robots and Systems, volume 2, pages 27–31, 2003. [19] R. Poppe, “Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets”, CVPR EHuM2: 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, 2007. [20] R. Poppe, “Vision-based Human Motion Analysis: An Overview”, Computer Vision and Image Understanding, vol. 108, pages 4-18, 2007. [21] D. Ramanan, D.A. Forsyth, and A. Zisserman, “Tracking People by Learning Their Appearance”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2007.

[22] G. Slabaugh, B. Culbertson, and T. Malzbender, “A survey of methods for volumetric scene reconstruction for photographs”, In International Workshop on Volume Graphics, pages 81–100, 2001. [23] A. Sundaresan, R. Chellappa, “Model driven segmentation of articulating humans in Laplacian Eigenspace”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2007. [24] E. Ueda, Y. Matsumoto, M. Imai, and T. Ogasawara, “Hand-pose estimation for vision-based human interfaces”, IEEE Transactions on Industrial Electronics, 2003. [25] N. Werghi, "Segmentation and Modeling of Full Human Body Shape From 3-D Scan Data: A Survey", IEEE Transactions on Systems, Man, and Cybernetics, Part C 37(6): 1122-1136 (2007).

15

http://ieeexplore.ieee.org/search/searchresult.jsp?disp=cit&queryText=(ramanan%20%20d.%3cIN%3eau)&valnm=Ramanan%2C+D.&reqloc%20=others&history=yeshttp://ieeexplore.ieee.org/search/searchresult.jsp?disp=cit&queryText=(%20forsyth%20%20d.%20a.%3cIN%3eau)&valnm=+Forsyth%2C+D.A.&reqloc%20=others&history=yeshttp://ieeexplore.ieee.org/search/searchresult.jsp?disp=cit&queryText=(%20zisserman%20%20a.%3cIN%3eau)&valnm=+Zisserman%2C+A.&reqloc%20=others&history=yes

ABSTRACT

Towards Multilevel Human Body Modeling and Tracking in 3D...

Documents

Transcript of Towards Multilevel Human Body Modeling and Tracking in 3D...