Capture of Arm-Muscle Deformations using a Depth-Camera
Embed Size (px)
Transcript of Capture of Arm-Muscle Deformations using a Depth-Camera
Capture of Arm-Muscle Deformations using aDepth-Camera
Nadia RobertiniUniversity of Saarland
Saarbruecken, [email protected]
Dresden, [email protected]
Dr. Kiran Varanasi
Max-Planck-Institut InformatikSaarbruecken, Germany
Prof. Dr. ChristianTheobalt
Max-Planck-Institut InformatikSaarbruecken, Germany
ABSTRACTModeling realistic skin deformations due to underneath muscle bul-ging has a wide range of applications in medicine, entertainmentand art. Current acquisition systems based on dense markers andmultiple synchronized cameras are able to record and reproducefine-scale skin deformations with sufficient quality. However, thecomplexity and the high cost of these systems severely limit theirapplicability. In this paper, we propose a method for reconstructingfine-scale arm muscle deformations using the Kinect depth cam-era. The captured data from the depth camera has no temporalcontiguity and suffers from noise and sensory artifacts, and thusunsuitable by itself for potential applications in visual media pro-duction or biomechanics. We process noisy depth input to ob-tain spatio-temporally consistent 3D mesh reconstructions show-ing fine-scale muscle bulges over time. Our main contribution isthe incorporation of statistical deformation priors into the spatio-temporal mesh registration progress. We obtain these priors froma previous dataset of a limited number of physiologically differentactors captured using a high fidelity acquisition setup, and thesepriors help provide a better initialization for the ultimate non-rigidsurface refinement that models deformations beyond the range ofthe previous dataset. Thus, our method is an easily scalable frame-work for bootstrapping the statistical muscle deformation model,by extending the set of subjects through a Kinect based acquisi-tion process. We validate our spatio-temporal surface registrationmethod on several arm movements performed by people of differ-ent body shapes.
Currently at Technicolor Research & Innovation, Rennes, France
1. INTRODUCTIONReconstructing high-quality muscle deformations in a non-intrusivemanner is a key problem in the areas of entertainment, human biome-chanics and human-centered design. Depth cameras like MicrosoftKinect, that recently appeared on the consumer market, provide arelatively cheap and easy mechanism to capture 3D images. How-ever, the captured depth images have significant artifacts due tosensor noise, occlusions and the lack of temporal contiguity in cap-ture. As such, these are unusable for researchers in biomechan-ics or human-computer interfaces who want to build accurate user-specific models of muscle deformations. In the current paper, wepropose a method for reconstructing high quality and temporallyaligned 3D meshes from depth images captured by the Kinect cam-era, for the human shoulder-arm region. Thus our method bridgesan important gap and enlarges the research scope for many areasconcerned with modeling muscle deformations, making them cap-italize on cheap consumer hardware. For example, realistic vir-tual humans and their muscle movements can be modeled for vi-sual media production in a cost-effective manner, through the useof cheap acquisition systems and fewer hours of manual work byartists. Sports scientists and medical practitioners can observe thephysiological action of muscles on a day-to-day basis and providepersonalized advice to sportsmen and patients without the use ofexpensive and intrusive sensors.
At present, modeling realistic muscle deformations of virtual hu-mans remains a highly labour intensive task. Commercial systemsuse specialized kinematic rigs for virtual characters, which havehundreds of control parameters to derive localized bulging effectson a fine scale by approximating them with a set of bones. As analternative, bio-mechanically based simulation of human anatomyand physics-based muscle deformation can be performed. How-ever, this remains computationally very expensive and such rigs arehard to control and adapt to new characters. Thus, data-driven sim-ulation methods have been developed in order to overcome some ofthese limitations. Based on a training set of artist-given deforma-tion examples or 3D scans acquired directly from the real world, anartistic interface can be developed that is simple to use, but whichreproduces complex muscle deformation behavior as visible in thetraining set. These data-driven simulation methods bridge an im-portant gap in the artistic production pipeline. However, acquiring
Figure 1: Overview of our capture pipeline: depth measurements are first filtered and then used for an initial surface reconstruction. Astatistical deformation model based on two different datasets is built and then used to clean the initial reconstruction within the space oflearned deformations. A last refinement step is used to capture fine scale details not captured by previous shapes in the database.
the training set of spatio-temporally aligned muscle deformationexamples remains a challenging problem. In order to acquire fine-scale deformation, a lot of markers have to be placed on the hu-man body and tracked using expensive imaging systems . Thecomplexity of this acquisition process places a high limiting bar-rier for novice artists and practitioners from taking advantage ofthe research advances in data-driven muscle simulation. Further-more, people exhibit an enormous statistical variation in muscledeformations with respect to body pose. The data-driven simula-tion methods are, by their very design, restricted in their modelingability to the limited set of human subjects captured in the trainingdata. Unless the acquisition process becomes cheap and simple touse, it is difficult to capture a substantially large set of people andmodel the statistical variation in their muscle deformations. In thispaper, we make a contribution in this regard by proposing a novelacquisition method based on the Kinect depth camera.
The Kinect depth sensor has been deployed with great success forvarious tasks by researchers in robotics, computer vision and human-computer interaction. However, most of these tasks have been re-stricted to reconstructing a static world  or recovering the motiondynamics only at a coarse scale . Modeling fine scale non-rigidsurface deformation, with using a consumer-grade depth sensor likethe Kinect, remains an immense challenge. The noise artifacts thatoccur in the depth image make the simultaneous recovery of ac-curate 3D geometry and motion severely under-constrained. Theartifacts in the depth image can arise from limitations in the imag-ing process, the limited resolution of the capture sensor, and due tosurface occlusions that naturally occur during motion. In this pa-per, we propose a method to process these noisy depth images andreconstruct high-quality spatio-temporally aligned 3D meshes. Ourmain observation is that a previously acquired dataset of muscle de-formations (from a set of 10 subjects captured with a multi-cameraacquisition system, kindly made available by ) provides usefulpriors for initializing the 3D registration, and ultimately to recon-structing fine-scale 3D surface deformations beyond the previousdataset. We make the following key-contributions to push the re-
search agenda in this field.
1. We provide a method for filtering three-dimensional corre-spondence estimation in the noisy depth map input, by usinggeometric priors of deformation and statistical priors learnedfrom a capture dataset.
2. We provide a framework for extending the generalizationscope of a statistical model of deformation, by capturing morepeople than present in the initial dataset.
2. RELATED WORKModeling fine-scale muscle deformations has long been an activeresearch topic in computer graphics. We refer the readers to therelated work section in  for an elaborate review of muscle defor-mation models: skinning approaches, physiologically-based simu-lation models and data-driven simulation models. In the following,we review only certain important related works in data-driven mod-eling of muscle deformations.
In 2006, Park and Hodgins  developed and demonstrated anacquisition system for capturing fine-scale muscles deformationsat high-speed motions on several actors. They used a very large setof reflective markers (350 against 40-60 previously used) placed onmuscular and fleshy parts of the body. They first captured the rigidbody motion of the markers and then used the found residual defor-mations to deform a hand-designed subject-specific model. How-ever, the marker application time and acquisition complexity wereextremely high. This inhibits the acquisition of a larger numberof subjects, which can contribute with more muscle data and skinmotion. Another limitation of their system is the impossibility togeneralize the acquired dynamic captured motion for different bodytypes. In their later work in 2008 , they presented a data-driventechnique for synthesizing skin deformation from skeletal motion.Using the same input data they used in the previous work, theybuild up a database of deformation data separately parametrized bypose and acceleration. Afterward they learned respectively pose
and acceleration specific deformation using Principal ComponentAnalysis (PCA) and built a statistical model. Because of the com-plexity of the acquisition step, they filled the database with a hugeamount of poses from a single subject, causing the statistical modelagain to be highly shape dependent. Although they introduced thepossibility to generate novel motions of subjects with similar bodyshapes as the one contained in their database. Using similar ac-quisition system and pipeline, a later work presented by Hong etal. in 2010  showed an improved skeleton configuration that,combined with standard skinning algorithms, generates a more vi-sually pleasing and physically accurate skin deformation. Focusingon the shoulder complex, consisting of shoulder, elbow and wristjoint, they concluded that inserting one additional segment betweenthe chest and the upper arm greatly improves the motion simulationof the shoulder. The main drawbacks are caused by the generalityof their learning algorithm, highly dependent on the completenessof the captured poses. Furthermore, their model is subject-specificand suffers from the same limitations as the previous discussedwork.
A recent work proposed by Neumann et al. , addresses someof the limitations in previous work, and builds a more generalizedstatistical model across multiple people with different body shapes.As Hong et al.  they focused on the shoulder complex, first cap-turing shape variations, using a novel acquisition and reconstruc-tion approach, and secondly modeling the deformations as a func-tion of body pose, shape and external forces. Even though usingonly a low number of parameters, their model is capable of repro-ducing fine-scale muscle deformations in novel poses and shapes,and for the first time, under the action of several external forces onthe arm. Because of its efficiency, the model can be interactivelyused by artists to reproduce appealing complex skin deformationseffects. However, the complexity of the acquisition system limitedtheir acquisition to just a small number of 10 subjects. In this work,we use the dataset kindly made available to us by  to derive sta-tistical priors for registering the template mesh showing the humanarm to a noisy depth image captured from the Kinect camera. Thus,we propose an easily scalable framework for extending the statisti-cal model by capturing more subjects in a much easier acquisitionsetup Please note that, unlike , we do not consider the problemof modeling the effect of external forces on surface deformation,owing to the capture limitations of the depth sensor. We limit our-selves to modeling arm muscle deformations due to body pose andbody shape variations amongst people.
3. OVERVIEWOur method aims for the reconstruction of high quality spatio-temporallyaligned 3D meshes of the human arm from noisy depth images. Weexpect the human subjects to stand closely to the Kinect cameraand perform arm movements through shoulder and elbow joints in aslow and natural manner. As argued by  (please also refer to )these movements can be interpreted as quasi-static with respect totheir underlying biomechanics, without the need to consider dy-namics, such as jiggling of the flesh or skin in rapid motions.
As input to our method, we take depth image frames which aredisconnected depth measurements affected by noise and quantiza-tion artifacts, that together with low resolution, poorly representthe original captured shape. Further, they lack measurements onthe back side of the arm, which is invisible from the sensor point ofview. From this input, we produce a sequence of temporally consis-tent 3D mesh reconstructions that show the arm motion at fine-scale
detail. The overview of our acquisition and reconstruction pipelineis shown in Figure 1.
We start by performing depth data cleaning to improve its quality(section 4). After that, we fit a template arm mesh to the depthpoints, by registering the template against the measurements (sec-tion 5). This step allows to overcome limitations such as unknownoverall shape, including the interpolation of measurements fromunseen areas, like the back side. Furthermore, the use of the tem-plate allows to describe the motion by explaining the relation be-tween depth measurements taken at different time steps. For theregistration phase, we perform an improved version of the clas-sical Non-Rigid Iterative Closest Point (ICP) algorithm. Specifi-cally, we filter the correspondences through a statistical deforma-tion model (section 6) which predicts local muscle bulging with re-spect to change of pose and shape as a prior for tracking. For eachsubject, we obtain an initial template mesh from the model that rep-resents the subjects specific shape, and restricts the motion to theallowed pose deformations described by the model. This way weprevent unwanted distortions caused by the general Non-Rigid ICPapproach and obtain a good looking arm which roughly aligns tothe measurements through time. At this point, when the arm meshand the measurements are close enough to each other, we performa final refinement step (section 7), deforming the mesh towards themeasurements, which are already filtered and temporally coherent,without any restriction imposed by the model. This way, we cancapture new fine scale detail, even when it lies outside of the spacerepresented by the current statistical arm model and we can feadthat additional detail back into the mathematical model to extendit.
4. DEPTH-MAP ACQUISITION AND PRE-PROCESSING
Figure 2: The capturing depth maps (a) are first segmented which re-moves unwanted background objects (b). The small box on the top-rightside shows the corresponding RGB image from the image sensor (whichwe do not use for reconstruction). Distance is color coded using the scaleon the right. The dark blue pixels are missing measurements.
Experiments have shown that Kinects measurements suffer mainlyfrom two limitations: random uncontrolled noise, that increaseswith increasing distance to the sensor, and missing data. There aremultiple reasons that cause the sensor to miss depth measurementson certain scene areas. Some of them are connected with the sensorrange and some other with the so-called shadows [1, 2, 7], whichare due to the disparity between the camera and projector of theKinect. In order to improve the initial depth-map quality, we firstperform segmentation. Our goal here is to retain measurementsthat are placed on the subjects arm, which are our main and onlyinterest (see Figure 2). Our segmentation algorithm uses simplethresholding based on the distance from the sensor.
Next, we improve the segmented depth-data quality by running
Figure 3: Depth map smoothing: (a) input depth map, (b) depth map afterthe smoothing. Notice how in (a) structural/quantization artifacts are visi-ble, whereas the result in (b) shows a smoother surface that is better suitedfor surface reconstruction.
well-studied filtering algorithms: a median and a Gaussian filter.The median filter removes outliers arising from random noise, andthe Gaussian smoothing reduces structural noise caused by the quan-tization of the data (see Figure 3). We get rid of flying pixels aroundthe borders, by morphological thinning.
5. SURFACE REGISTRATIONIn order to reconstruct the original surface from disconnected mea-surements obtained from the Kinect, we use a surface fitting basedapproach. Such approach guarantees consistent surfaces with fixedtopology, which are optimal for tracking fine-scale skin deforma-tions over time. Our algorithm is similar to the one proposed byStoll et al. . In particular, after performing an initial align-ment, which rigidly aligns the template mesh to the sampled points,we start a non-rigid ICP method, that iteratively deforms the tem-plate mesh towards the measurements. The deformation process isguided by local point correspondences.
5.1 Rigid AlignmentIn this step, we fix the space within which the human subject movesthe arm relative to the location of the template mesh, such that theimaged point cloud is close to the template mesh. This step is par-ticularly important in the registration process, where we are goingto register the template against the measurements. Accurate initialsurface approximation (in terms of vicinity) of the sample points,obtained from the Kinect, considerably increases the probabilityof succeeding in generating high-quality reconstructions. By con-straining the subjects position and orientation with respect to thesensor (e.g. facing the sensor roughly in a fronto-parallel orien-tation) to be fixed throughout the entire arm motion sequence, itis possible to find a global rigid transformation (scaling + rotation+ translation) relative to the template mesh, and we apply this toall the point clouds in the sequence. We ask the subject to orientthe arm and shoulder joints at the first frame such that the imagedpoint cloud is already near to the template mesh, which can later betracked over the sequence.
5.2 Finding and Filtering CorrespondencesThe problem of finding correspondences between a source and atarget representation set has been intensively studied, particularlyin surface matching and registration. If correct correspondences areknown it is possible to find the correct transformation that aligns thesets. Since we have already quite close surface and measurements,coming from the previous step, we can proceed by computing theclosest point correspondences. For each mesh vertex we find the
Figure 4: Finding and filtering correspondences example. (Gray) The tem-plate mesh, (Red) Point cloud measurements, (Green) Nearest correspon-dences, (Black) Filtered correspondences.
nearest point sample and set it as possible good correspondence.The next step is filtering the correspondences . We propose thefollowing filtering strategies to suit to our specific problem setting.
Arm vertices only: Our mesh model is composed of the arm andpart of the chest encompassing biceps, triceps, deltoid and pec-toralis muscles. A hand is attached to the arm for aesthetic visu-alization but not explicitly considered for tracking or deformationmodeling. As we would like to focus on deforming the arm shapes,we need to restrict the considered pairs to the ones placed on thearm.
Front side only: We remove all the correspondence pairs com-ing from the back side of the arm model. In fact, from the sensorspoint of view only half of the arm is visible (the front-side), andpossible pairs should lie on the same model side. To this end, wecompare vertex normals nv directions with the known sensors viewdirection s, and reject all inconsistent pairs, that do not satisfy thecondition: