Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever, Ariel Kapusta,...

3D Human Pose Estimation on a Configurable Bed from a Pressure Image

Henry M Clever Ariel Kapusta Daehyung Park Zackory Erickson Yash Chitalia Charles C Kemp

Abstractmdash Robots have the potential to assist people in bedsuch as in healthcare settings yet bedding materials like sheetsand blankets can make observation of the human body difficultfor robots A pressure-sensing mat on a bed can provide pres-sure images that are relatively insensitive to bedding materialsHowever prior work on estimating human pose from pressureimages has been restricted to 2D pose estimates and flat bedsIn this work we present two convolutional neural networks toestimate the 3D joint positions of a person in a configurablebed from a single pressure image The first network directlyoutputs 3D joint positions while the second outputs a kinematicmodel that includes estimated joint angles and limb lengths Weevaluated our networks on data from 17 human participantswith two bed configurations supine and seated Our networksachieved a mean joint position error of 77 mm when testedwith data from people outside the training set outperformingseveral baselines We also present a simple mechanical modelthat provides insight into ambiguity associated with limbs raisedoff of the pressure mat and demonstrate that Monte Carlodropout can be used to estimate pose confidence in thesesituations Finally we provide a demonstration in which amobile manipulator uses our networkrsquos estimated kinematicmodel to reach a location on a personrsquos body in spite of theperson being seated in a bed and covered by a blanket

I INTRODUCTION

Various circumstances such as illness injury or longtermdisabilities can result in people receiving assistance in bedPrevious work has shown how robots can provide assistancewith ADLs [1]ndash[3] but providing assistance to a person inbed can be challenging Estimating the pose of a personrsquosbody could enable robots to provide better assistance Typicalmethods of body pose estimation use line-of-sight sensorssuch as RGB cameras which can have difficulties whenthe body is occluded by blankets loose clothing medicalequipment over-bed trays and other common items inhealthcare settings such as hospitals A pressure-sensingmat on the bed can allow for estimation of the bodyrsquos posein a manner that is less sensitive to bedding materials andsurrounding objects [2] [4]ndash[6] However prior work withpressure images has not addressed a number of concernskey to the success of robot assistance in bed namely (1)pose estimation in 3D for either flat or non-flat beds and (2)appropriately dealing with uncertainty when the pose estimatemay be inaccurate

In this work we present a method for estimating the 3Djoint positions in real time with a measure of confidence ineach estimated position for a person in a configurable bedusing a pressure-sensing mat We provide evidence that ourmethod works for some challenging scenarios such as when

H M Clever A Kapusta D Park Z Erickson Y Chitalia and C Kempare with the Healthcare Robotics Lab Institute for Robotics and IntelligentMachines Georgia Institute of Technology

H M Clever is the corresponding authorhenryclevergatechedu

Fig 1 We demonstrate how an assistive robot could use our 3D human poseestimation method A PR2 robot uses our methodrsquos body pose estimation toreach to a personrsquos shoulder

the bed and human are configured in a seated posture andwhen limbs are raised off of the pressure sensing mat Furtherwe release a motion capture labeled dataset of over 28000pressure images across 17 human participants in addition toour open-source code

In prior work we used a pressure sensing mat on aconfigurable bed to estimate the location of a personrsquos headfor positioning a rigid geometric model of a human body [2]In this paper we propose and compare two convolutionalneural network (ConvNet) architectures to learn a mappingfrom a pressure image and bed configuration to 3D humanjoint positions the head neck shoulders elbows wristschest hips knees and ankles The first method directlyregresses to the 3D ground truth labels (x y z) The secondmethod embeds a 17-joint kinematics skeleton model of thehuman to the last layer of the ConvNet to enforce constraintson bone lengths and joint angles This provides a morecomplete pose representation with additional unlabeled jointsand latent space joint angle estimates We introduce a newarchitecture that adjusts the skeleton model for differentlysized people while providing comparison to a constant linklength skeleton model We train the kinematics ConvNetsend-to-end backpropagating from 3D joint Euclidean errorthrough the kinematics model We compare our ConvNetmethods to baseline data-driven algorithms including ridgeregression and k-nearest neighbors

Our configurable bed introduced previously as Autobed [2][7] features adjustable height leg rest angle and head restangle Autobed can sense its own state sense the pressuredistribution of the person on the bed and communicate withother devices In this work we estimate the human jointpositions in two Autobed configurations by adjusting thebedrsquos head rest supine (0 flat) and seated (60 incline)

Lack of contact by limbs or other body parts presents a

arX

iv1

804

0787

3v2

[cs

RO

] 2

9 A

ug 2

018

challenging issue for the pressure image modality In thiswork we consider common poses where this issue arisessuch as an arm raised in the air resembling a double invertedpendulum Among other poses this case demonstrates wherethe pressure image can be similar for different configurationsof the arm In such a case the pressure data may beinsufficient to confidently estimate the pose of the armAn estimate of confidence or model uncertainty can bevaluable for example to allow an assistive robot to reject low-confidence estimates by removing them from a list of potentialgoals in task plan execution To estimate model uncertaintywe use Monte Carlo dropout a method proposed by Gal andGhahmarani [8] With this method we perform a numberof stochastic forward passes through the ConvNet duringtest time and compute the joint position and joint positionconfidence from the moments of the output distribution

II RELATED WORK

Markerless human pose estimation is a challenging problemcomplicated by environment factors the human pose config-urations of interest and data type Relatively few researchershave used pressure images for human pose estimation in bed[4]ndash[6] while many used cameras in myriad environments andposes [9]ndash[21] Researchers have increasingly explored data-driven methods such as ConvNets from model-free networksto inclusion of models in various architectures Here wediscuss research with pressure images data-driven methodsand measuring network uncertainty

A Pressure-Image-based Work

Prior pressure-image-based pose estimation work has fit2D kinematic models to pressure image features In a seriesof papers including [4] Harada et al used a kinematic modelto create a synthetic database for comparison with groundtruth pressure images Grimm et al [5] identified humanorientation and pose using a prior skeleton model Similarto our motivation and findings they used a pressure mat tocompensate for bedding occlusion and observed higher errorfor lighter joints (eg the elbow) which had a relatively lowpressure Liu et al [6] generated a pictorial structures modelto localize body parts on a flat bed

A few researchers have also looked at human postureclassification from pressure images [5] [22] [23] Postureclassification is a different problem from 3D body poseestimation but it can be used in a complementary way egby providing a prior on the model used for pose estimation

B Data-driven Human Pose Estimation

Like the pressure-image-based research we use a humanbody model but take a data-driven approach more commonin vision-based work While infrared and depth images haveseen recent attention in human pose estimation [9] a largebody of vision-based work uses monocular RGB cameras [10]Our method builds upon research with monocular RGB imageinput we note the following similarities and differences

bull Single image Both monocular RGB-image-based workand our pressure-image-based work has a single inputarray

bull Under-constrained For 3D human pose both monocularRGB images and single pressure images are under-constrained

bull Data content The data encoding is fundamentallydifferent For example in the context of pose estimationlight intensity in an RGB image is highly disconnectedfrom pressure intensity

bull Dimensionality RGB images typically have more fea-tures and a higher resolution than pressure images

bull Warped spatial representation Calibrating RGB camerasto alleviate distortion is straightforward In contrast itis challenging to determine the configuration of a clothpressure sensing mat from its pressure image in the caseof folds or bends

While the differences may limit the transferability ofmethods across data types we find that some data-drivenmethods previously used for vision modalities are applicableto pressure images Researchers have performed 3D humanpose estimation with monocular RGB images using standardmachine learning algorithms such as ridge regressors [11]ndash[13] In particular Okada and Soatto [11] use kernel ridgeregression (KRR) as well as linear ridge regression (LRR)Further Ionescu et al [12] compare K-Nearest Neighbors(KNN) and ridge regression on the Human36M datasetWe use these classical approaches to provide a baselinecomparison for our proposed method

Recently with the advent of high quality labeled syn-chronous datasets such as Human36M [12] many researchershave explored deep learning methods such as end-to-endtraining of ConvNets [14]ndash[17] Two common ConvNetapproaches include direct regression to joint labels [14] [21]and regression to discretized confidence maps [15] [16][18] Within 3D human pose estimation research confidencemap approaches include Pavlakos et al [16] who train aConvNet end-to-end on a 3D confidence voxel space andZhou Zhu et al [19] and Bogo et al [20] who fit a 3D modelto 2D confidence maps However the high dimensional outputspace of confidence maps can make real-time pose estimationdifficult Li and Chan [21] used rapid direct regression to 3DCartesian joint positions we implement a similar architecturebecause real-time estimation is important to our plannedapplication Zhou Sun et al [17] take a hybrid approach bytraining a ConvNet end-to-end and enforcing anthropomorphicconstraints with an embedded human skeleton kinematicsmodel with constant link lengths We implement a methodof this form for comparison and introduce a new architecturewith variable skeleton link lengths to allow the model toadapt to differently sized people

C Monte Carlo Dropout in ConvNets

We use Monte Carlo dropout to measure network uncer-tainty a method introduced by Gal and Ghahramani [8]Monte Carlo dropout has been applied to measure uncertaintyfor camera relocalization [24] and semantic segmentation[25] We use this method to estimate pose and provide ameasure of confidence

Fig 2 Two ConvNet architectures The Direct ConvNet directly regresses to ten motion capture labeled global joint positions The Kinematic ConvNetembeds a kinematics skeleton model into the last fully connected network layer parameterized in the latent space by a root joint global position jointangles and skeleton link lengths The Kinematic ConvNet also outputs unlabeled joint position estimates We explore architectures with both variable linklength (shown) and constant link length (dashed arrows defined by lowast)

III METHOD

Our ConvNets learn a function f(P θB) that estimatespose parameters of a person lying in a robotic bed given aspecified bed configuration θB and a 2D pressure image Pfrom a pressure mat

A ConvNet Architecture

We explore two ConvNet architectures shown in Fig 2Our network includes four 2D convolutional layers with 64output channels for the first two layers and 128 channels forthe last two layers The layers mostly have 3times 3 filters witha ReLU activation and a dropout of 10 applied after eachlayer We apply max pooling and the network ends with alinear fully connected layer

To estimate a personrsquos joint pose we construct an inputtensor for the ConvNet comprised of three channels ieP EB isin R128times54times3 Raw data from the pressure sensingmat is recorded as a (64times 27)-dimensional image which weupsample by a factor of two ie P isin R128times54 We use firstorder interpolation for upsampling In addition to a pressureimage we also provide the ConvNet with an edge detectionchannel E isin R128times54 which is computed as a Sobelfilter over both the horizontal and vertical directions of theupsampled image Empirically we found this edge detectioninput channel improved pose estimation performance In orderto estimate human pose at different configurations of the bed(eg sitting versus lying down) we compute a third inputchannel B isin R128times54 which depicts the bed configurationSpecifically each element in the matrix B depicts the verticalheight of the corresponding taxel on the pressure mat Whenthe bed frame is flat ie θB = 0 then B is simply the zeromatrix

B Direct Joint Regression

The first proposed ConvNet architecture outputs an estimateof the motion capture labeled global 3D joint positionss1 sN where each sj isin R3 represents a 3D positionestimate for joint j This direct ConvNet regresses directlyto 3D ground truth label positions in the last fully connectedlayer of the network We compute the loss from the absolute

Fig 3 Kinematics Model Parameters The root joint schest is specifiedin red Yellow boxes represent the remaining motion capture labeled jointss2 sN Grey boxes represent unlabeled joints su1 suK wherethe kinematic model adjusts to an approximate fit

value of Euclidean error on each joint

Lossdirect =

Nsumj=1

||sj minus sj || (1)

C Deep Kinematic Embedding

We embed a human skeleton kinematics model into thelast fully connected network layer to enforce geometricand anthropomorphic constraints This creates an extranetwork layer where labeled joint estimates s1 sN andunlabeled joint estimates su1 suK are solved throughforward kinematics equations depending on root joint positions1 latent space joint angles φ and an estimate of skeletonlink length approximations l We use s1 = schest as the rootjoint We incorporate skeleton link lengths l into the lossfunction by pre-computing an approximation to the groundtruth While the kinematics functions are relative to a rootjoint we learn root joint global position schest to put ouroutput in global space We compute a weighted loss fromthe absolute value of Euclidean error on each joint and erroron each link length

Losskin = ||s1 minus s1||+ α

Nsumj=2

||sj minus sj ||+ β|lminus l| (2)

where α and β are weighting factors We compare two variantsof this loss function The variable link length ConvNet is as

Fig 4 Range of paths traversed by participants during training Equivalentpaths were traversed by the left limbs Count represents both left and rightdata across 17 participants

described while for the constant link length ConvNet we setβ = 0 and use a constant llowast input to the kinematics modelWe compute llowast as the average of the approximations l

1) Human Kinematics Model We represent the humanbody with a model similar to that used in other work [12][17] [21] [26] with 17 joints to cover major links downto the wrists and ankles We ignore minor links and jointsTo train the networks that have link lengths as an outputwe require ground truths for comparison Some ground truthlink lengths may be calculated directly from the datasetrsquoslabels for example when motion capture gives the locationof both ends of the link We approximate the link lengths forlinks that are under-constrained in the dataset for the skeletonmodel The link lengths are an output of our network asa l isin R17 vector We ignore unlabeled joints in the lossfunction

The mid-spine is found by a vertical offset from thechest marker to compensate for the distance between markerplacement atop the chest and the modeled bending point ofthe spine We do not make offset corrections with other jointsthese are more challenging than the chest and the effects areless noticeable

2) Angular Latent Space We define 20 angular degreesof freedom consisting of 3-DOF shoulders and hips 1-DOF elbow and knee joints a 2-DOF spine joint and a2-DOF neck joint Fig 3 (b) shows this parameterizationcorresponding to labeled and unlabeled joints For the spineof the model to better match the spine of a person seatedin bed we used two revolute joints about the x-axis Toaccount for head movement we placed a neck joint at themidpoint of the shoulders with pitch and yaw rotation We usePyTorch [27] a deep learning library with tensor algebra andautomatic differentiation We manually encoded the forwardkinematics for the skeleton kinematic model The networkuses stochastic gradient descent during backpropagation tofind inverse kinematics (IK) solutions

Fig 5 (a) Crossover of the shoulder from the supine elbow 0 verticaltraversal both joints envisioned as 2-DOF inverted pendulum Green dotsrepresent left elbow and wrist ground truth markers projected in 2D yellowdot and arrows indicate approximate shoulder and limb positions (b) Modelof 2-DOF inverted pendulum showing static indeterminacy Known pressuredistribution P (x) is insufficient to solve θ1 θ2

D Pressure Image Ambiguity

Raising a limb off of a bed with pressure sensors can leadto a loss of information as the sensors can only sense pressureduring contact A similar loss of information can be seenwhen the limbs extend off the edge of the bed Considerthe movement shown in Fig 4 (c) and an example of thepressure images associated with such a movement in Fig 5(a) Here the pressure images appear nearly identical whilethe elbow and wrist positions change substantially To betterunderstand what is physically causing this phenomena wecan model the arm as a double inverted pendulum shown inFig 5 (b) Here the pendulum angles θ1 and θ2 are staticallyindeterminate given an underlying pressure distribution untilpart of the arm touches the sensors

Another challenge is pressure sensor resolution Dependingon the type of sensor saturation can occur A pressure imagewith higher spatial resolution accuracy and pressure rangemay result in less ambiguity We note that a model of bodyshape (eg 3D limb capsules) might provide information thathelps to resolve the ambiguity

E Uncertainty Monte Carlo Dropout

To estimate both joint position and model uncertaintysimultaneously we apply Monte Carlo dropout from Galand Ghahramani [8] Monte Carlo dropout is the processof performing V forward passes through the network withdropout enabled This results in V output vectors which maydiffer slightly due to the stochastic dropout of data duringeach forward pass We can compute an estimated output as theaverage of all V outputs corresponding to the first momentof the predictive distribution within the network Similarlythe modelrsquos uncertainty corresponds to the second momentof the distribution which we can compute as the variance ofall V forward passes

IV EVALUATION

We recorded a motion capture labeled dataset with over28000 pressure images from 17 different human participants1

We conducted this study with approval from the Georgia

1Dataset ftpftp-hrlbmegatechedupressure_mat_pose_data

Institute of Technology Institutional Review Board (IRB) andobtained informed consent from all participants We recruited11 male and 6 female participants aged 19-32 who ranged157-183 m in height and 45-94 kg in weight We fittedparticipants with motion capture markers at the wrists elbowsknees ankles head and chest We used a commerciallyavailable 64 times 27 pressure mat from Boditrak sampled at7 Hz We asked each participant to move their limbs in 6patterns 4 while supine and 2 while seated to represent somecommon poses in a configurable bed The movement pathsare shown in Fig 4 We instructed participants to keep theirtorso static during limb movements

We trained six data-driven models three baseline super-vised learning algorithms and the three proposed ConvNetarchitectures2 We designed the network using 7 participants(5M 2F) we performed leave-one-participant-out crossvalidation using the remaining 10 participants (6M 4F)

A Data Augmentation

At each training epoch for the ConvNets we selectedimages such that each participant would be equally repre-sented in both the training and test sets We augmented theoriginal dataset in the following ways to increase trainingdata diversity

bull Flipping Flipped across the longitudinal axis withprobability P = 05

bull Shifting Shifted by an additive factor sh sim N (micro =0cm σ = 286cm)

bull Scaling Scaled by a multiplicative factor sc sim N (micro =1 σ = 006)

bull Noise Added taxel-by-taxel noise to images by anadditive factor N (micro = 0 σ = 1) Clipped the noiseat min pressure (0) and at the saturated pressure value(100)

We chose these to improve the networkrsquos ability to generalizeto new people and positions of the person in bed We didnot shift the seated data longitudinally or scale it because ofthe warped spatial representation

B Baseline Comparisons

We implemented three baseline methods to compare ourConvNets against K-nearest neighbors (KNN) Linear RidgeRegression (LRR) and Kernel Ridge Regression (KRR) Forall baseline methods we used histogram of oriented gradients(HOG) features [28] on 2times upsampled pressure images Weapplied flipping shifting and noise augmentation methodsWe did not use scaling as it worsened performance

1) K-Nearest Neighbors We implemented a K-nearestneighbors (KNN) regression baseline using Euclidean distanceon the HOG features to select neighbors as [12] did Weselected k = 10 for improved performance

2) Ridge Regression We implemented two ridge-regression-based baselines Related work using these methodsare described in Section II We trained linear ridge regression(LRR) models with a regularization factor of α = 07 Wealso train Kernel Ridge Regression (KRR) models with a

2Code release httpsgithubcomgt-ros-pkghrl-assistivetreeindigo-develhrl_pose_estimation

radial basis function (RBF) kernel and α = 04 We manuallyselected these values of α for both LRR and KRR We alsotried linear and polynomial kernels for KRR but found theRBF kernel produced better results in our dataset

C Implementation Details of Proposed ConvNets

During testing we estimated the joint positions with V =25 forward passes on the trained network with Monte Carlodropout for each test image For each joint we report themean of the forward passes as the estimated joint positionWe use PyTorch and ADAM from [29] for gradient descent

1) Pre-trained ConvNet We created a pre-trained ConvNetthat we use to initialize both Kinematic ConvNets withregressed and constant link length The pre-trained networkused the kinematically embedded ConvNet and the lossfunction in Eq 2 with α = 05 and β = 05 This networkwas trained for 10 epochs on the 7 network-design participantswith a learning rate of 000002 and weight decay of 00005

2) Direct ConvNet We trained the network for 300 epochsdirectly on motion capture ground truth using the sum ofEuclidean error as the loss function We used a learning rateof 000002 and a weight decay of 00005

3) Kinematic ConvNet Constant Link Length We trainedthe network through the kinematically embedded ConvNetused the loss function in Eq 2 with α = 05 and β = 0This value for β means the network would not regress tolink length leaving it constant We initialized the networkwith the pre-trained ConvNet but we separately initializedeach link length as the average across all images in the foldrsquostraining set for each fold of cross validation

4) Kinematic ConvNet Regress Link Length We trainedthe network through the kinematically embedded ConvNetused the loss function in Eq 2 with α = 05 and β = 05Joint Cartesian positions and link lengths in the ground truthare represented on the same scale We initialized with thepre-trained ConvNet

D Measure of Uncertainty

Here we show an example where ambiguous pressure matdata has a high model uncertainty We compare two legabduction movements from the supine leg motion shownin the bottom two columns of Fig 4 (d) We sample 100images per participant half feature leg abduction contactingthe pressure mat and half with elevated leg abduction Foreach pose we use V = 25 stochastic forward passes andcompute the standard deviation of the Euclidean distancefrom the mean for abducting joints including knees andfeet We compare this metric between elevated and in-contactmotions

V RESULTS

In Table I we present the mean per joint positionerror (MPJPE) a metric from literature to represent overallaccuracy [12] [21] Fig 6 shows the per-joint position erroracross all trained models separated into supine and seatedpostures The error for the direct ConvNet and the kinematicConvNet with length regression is significantly lower than theother methods The results of knees and legs show that moredistal limbs on the kinematic chain do not necessarily result

Fig 6 Per-joint error mean and standard deviation for leave-one-participant-out cross validation over 10 participants for a sitting and a supine posture inbed Lower is better Our methods outperformed baseline methods

TABLE I Mean Per Joint Position ErrorMPJPE MPJPE MPJPESupine Seated Overall

Method (mm) (mm) (mm)

K-Nearest Neighbors 10201 9342 9906Linear Ridge Regression 11701 11476 11624Kernel Ridge Regression 9925 9710 9851

Direct ConvNet 7349 8244 7656Kinematic ConvNet avg l 9974 9970 9972Kinematic ConvNet regr l 7543 7919 7672

Fig 7 Comparison of the heaviest and tallest participant with the lightestand shortest participant Our kinematics ConvNet with link length regressionappears to adjust for both sizes

in higher error The wrists are both distal and light and havehigher error than the other joints Fig 7 shows the kinematicsConvNet with length regression adjusting for humans ofdifferent sizes and in different poses Furthermore we canperform a pose estimate with uncertainty using V = 25stochastic forward passes in less than a half second

A Measure of Uncertainty

We performed a t-test to compare uncertainty in elevatedleg abduction and in-contact leg abduction We comparedeach knee and ankle separately We found that the standard

deviation of the Euclidean distance from the mean of V = 25forward passes with Monte Carlo dropout is significantlyhigher for joints in the elevated position Further we notethat variance in the latent angle parameters θ compoundsthrough the kinematic model causing more distal joints in thekinematic chain to have higher uncertainty This phenomenais further described in Fig 8 which shows limbs removedfrom the mat that have a high variance

VI DISCUSSION AND LIMITATIONS

A Network Architecture Considerations

While the results for the direct ConvNet architecture weremarginally better than the kinematic ConvNet with variablelengths the latter has other advantages First there can bevalue in getting a skeletal model from the network providing amore complete set of parameters including 20 angular DOFsand a total of 17 joint positions Second unsurprisinglyrequiring that outputs from the ConvNet satisfy kinematicconstraints means that outputs will be constrained to plausiblelooking body poses Without those constraints the ConvNetcould produce unrealistic outputs

While limiting our skeleton model to 20 angular DOFspromotes simplicity it has some hindrance to generalizabilityFor example the mid-spine joint lacks rotational DOFs aboutthe y- and z-axes Adding these DOFs would allow themodel to account for rolling to a different posture and layingsideways in bed

B Data Augmentation Challenges

Data augmentation cannot easily account for a personsliding up and down in a bed that is not flat Vertical shiftingaugmentation for non-flat beds would not match the physicaleffects of shifting a person on the pressure mat Augmentationby scaling also has problematic implications because a muchsmaller or larger person may have a weight distribution thatwould not scale linearly at the bending point of the bed

Fig 8 Illustration of mean and standard deviation of kinematic ConvNet (l regression) output V = 25 forward passes with Monte Carlo dropout shownwith thin translucent skeleton links Spheres indicate joint position estimates (a) Right arm thrust into the air (b) Right forearm extending off the side ofthe pressure mat (c) Left leg extended in the air and abducted (d) Left knee extended in the air left foot contacting pressure mat Note (c) and (d) showthe same participant others are different

Simulation might resolve these issues by simulating placinga weighted human model of variable shape and size placedanywhere on a simulated pressure mat with many possibilitiesof bed configurations

C Dataset ConsiderationsThe posture and range of paths traversed by participants

may not be representative of other common poses and weexpect our method to have limited success in generalizing tobody poses not seen or rarely seen in the dataset While aparticipant is moving their arm across a specified path otherjoints remain nearly static which over-represents poses withthe arms adjacent to the chest and legs straight We foundover-represented poses to generally have a lower uncertaintyInterestingly with our current sampling and training strategyover-representation and under-representation is based on thepercentage of images in the dataset a joint or set of joints isin a particular configuration Additional epochs of training ordirectly scaling the size of the dataset does not change theseeffects on uncertainty of pose representation Our methodcould be improved by using weighting factors or samplingstrategies to compensate for this effect

In our evaluation some limb poses occur in separatetraining images but do not occur in the same training imageFor example we recorded one participant moving both armsand both legs simultaneously Fig 9 shows that our methodhas some ability to estimate these poses

The skeleton model has offset error in addition to theground truth error reported While we attempted to compen-sate for the chest marker offset the other markers were morechallenging This may have caused some inaccuracy in thelink length approximations

D Removal of High Variance JointsFig 10 shows that joints with high uncertainty have a higher

average error Discarding these joint estimates can decrease

Fig 9 Dual arm and dual leg traversals While these are excluded fromthe training set our methods can provide a reasonable pose estimate

Fig 10 Discarding joints with a higher uncertainty can decrease errorwith a tradeoff to the number of joints remaining

the average error of the model As an example application arobot using our methodrsquos estimated poses for task and motionplanning might want to require low uncertainty before planexecution

VII DEMONSTRATION WITH PR2 ROBOT

We conducted a demonstration of how our method couldinform an assistive robot trying to reach part a personrsquos bodyWe conducted this study with approval from the GeorgiaInstitute of Technology Institutional Review Board (IRB) andobtained informed consent from an participant We recruiteda single able-bodied participant who used a laptop computerrunning a web interface from [3] to command a PR2 robot

to move its end effector to their left knee and to their leftshoulder The robotrsquos goal was based on the estimated poseof the personrsquos body from our ConvNet with kinematicmodel regressing to link lengths For the knee position theparticipant raised her knee to the configuration shown in the2nd image of Fig 4 (f) and the participant was in the seatedposture for both tasks Using our 3D pose estimation methodthe robot was able to autonomously reach near both locationsFig 1 shows the robot reaching a shoulder goal while theparticipant is occluded by bedding and an over-bed table

VIII CONCLUSION

In this work we have shown that a pressure sensing matcan be used to estimate the 3D pose of a human in differentpostures of a configurable bed We explored two ConvNetarchitectures and found that both significantly outperformeddata-driven baseline algorithms Our kinematically embeddedConvNet with link length regression provided a more com-plete representation of a 17-joint skeleton model adheredto anthropomorphic constraints and was able to adjust toparticipants of varying anatomy We provided an examplewhere joints on limbs raised from the pressure mat had ahigher uncertainty than those in contact We demonstratedour work using a PR2 robot

ACKNOWLEDGMENT

We thank Wenhao Yu for his suggestions to this work This materialis based upon work supported by the National Science FoundationGraduate Research Fellowship Program under Grant No DGE-1148903 NSF award IIS-1514258 NSF award DGE-1545287 AWSCloud Credits for Research and the National Institute on DisabilityIndependent Living and Rehabilitation Research (NIDILRR) grant90RE5016-01-00 via RERC TechSAge Any opinion findings andconclusions or recommendations expressed in this material arethose of the author(s) and do not necessarily reflect the views ofthe National Science Foundation Dr Kemp is a cofounder a boardmember an equity holder and the CTO of Hello Robot Inc whichis developing products related to this research This research couldaffect his personal financial status The terms of this arrangementhave been reviewed and approved by Georgia Tech in accordancewith its conflict of interest policies

REFERENCES

[1] K Hawkins P Grice T Chen C-H King and C C Kemp ldquoAssistivemobile manipulation for self-care tasks around the headrdquo in 2014 IEEESymposium on Computational Intelligence in Robotic Rehabilitationand Assistive Technologies IEEE 2014

[2] A Kapusta Y Chitalia D Park and C C Kemp ldquoCollaborationbetween a robotic bed and a mobile manipulator may improve physicalassistance for people with disabilitiesrdquo in RO-MAN IEEE 2016

[3] P M Grice and C C Kemp ldquoAssistive mobile manipulation Designingfor operators with motor impairmentsrdquo in RSS 2016 Workshop onSocially and Physically Assistive Robotics for Humanity 2016

[4] T Harada T Sato and T Mori ldquoPressure distribution image basedhuman motion tracking system using skeleton and surface integrationmodelrdquo in ICRA vol 4 IEEE 2001 pp 3201ndash3207

[5] R Grimm S Bauer J Sukkau J Hornegger and G GreinerldquoMarkerless estimation of patient orientation posture and pose usingrange and pressure imagingrdquo International journal of computer assistedradiology and surgery vol 7 no 6 pp 921ndash929 2012

[6] J J Liu M-C Huang W Xu and M Sarrafzadeh ldquoBodypartlocalization for pressure ulcer preventionrdquo in EMBC IEEE 2014pp 766ndash769

[7] P M Grice Y Chitalia M Rich H M Clever and C C KempldquoAutobed Open hardware for accessible web-based control of an electricbedrdquo in RESNA 2016

[8] Y Gal and Z Ghahramanim ldquoDropout as a bayesian approximationRepresenting model uncertainty in deep learningrdquo in Conference onMachine Learning 2016 pp 1050 ndash 1059

[9] W Gong X Zhang J Gonzagravelez A Sobral T Bouwmans C Tuand E-h Zahzah ldquoHuman pose estimation from monocular imagesA comprehensive surveyrdquo Sensors vol 12 no 16 p 1966 2016

[10] N Sarafianos B Boteanu C Ionescu and I A Kakadiaris ldquo3d humanpose estimation A review of the literature and analysis of covariatesrdquoComputer Vision and Image Understanding no 152 pp 1ndash20 2016

[11] R Okada and S Soatto ldquoRelevant feature selection for human poseestimation and localization in cluttered imagesrdquo in ECCV Springer2008 pp 434ndash445

[12] C Ionescu D Papava V Olaru and C Sminchisescu ldquoHuman36mLarge scale datasets and predictive methods for 3d human sensingin natural environmentsrdquo Trans on Pattern Analysis and MachineIntelligence vol 36 no 7 pp 1325ndash1339 2014

[13] A Agarwal and B Triggs ldquoRecovering 3d human pose from monocularimagesrdquo Pattern Analysis and Machine Intelligence IEEE Transactionson vol 28 no 1 pp 44ndash58 2006

[14] A Toshev and C Szegedy ldquoDeeppose Human pose estimation viadeep neural networksrdquo in CVPR

[15] J J Tompson A Jain Y LeCun and C Bregler ldquoJoint trainingof a convolutional network and a graphical model for human poseestimationrdquo in Advances in neural information processing systems2014 pp 1799ndash1807

[16] G Pavlakos X Zhou K G Derpanis and K Daniilidis ldquoCoarse-to-fine volumetric prediction for single-image 3d human poserdquo in CVPRIEEE 2017

[17] X Zhou X Sun W Zhang S Liang and Y Wei ldquoDeep kinematicpose regressionrdquo in ECCV 2016 Workshops 2016 pp 186ndash201

[18] R V Wei Shih-En T Kanade and Y Sheikh ldquoConvolutional posemachinesrdquo in CVPR 2016 pp 4724ndash4732

[19] X Zhou M Zhu S Leonardos K G Derpanis and K DaniilidisldquoSparseness meets deepness 3d human pose estimation from monocularvideordquo in CVPR IEEE 2016 pp 4966ndash4975

[20] F Bogo A Kanazawa C Lassner P Gehler J Romero and M JBlack ldquoKeep it smpl Automatic estimation of 3d human pose andshape from a single imagerdquo in ECCV Springer 2016 pp 561ndash578

[21] S Li and A B Chan ldquo3d human pose estimation from monocularimages with deep convolutional neural networkrdquo in Asian Conferenceon Computer Vision Springer pp 332ndash347

[22] M Farshbaf R Yousefi M B Pouyan S Ostadabbas M Nouraniand M Pompeo ldquoDetecting high-risk regions for pressure ulcer riskassessmentrdquo in BIBM IEEE 2013 pp 255ndash260

[23] S Ostadabbas M B Pouyan M Nourani and N Kehtarnavaz ldquoIn-bed posture classification and limb identificationrdquo in BioCAS IEEE2014 pp 133ndash136

[24] A Kendall and R Cipolla ldquoModelling uncertainty in deep learningfor camera relocalizationrdquo in ICRA IEEE 2016 pp 4762ndash4769

[25] M Kampffmeyer A-B Salberg and R Jenssen ldquoSemantic segmen-tation of small objects and modeling of uncertainty in urban remotesensing images using deep convolutional neural networksrdquo in CVPRWIEEE 2016 pp 680ndash688

[26] D Mehta S Sridhar O Sotnychenko H Rhodin M Shafiei H-P Seidel W Xu D Casas and C Theobalt ldquoVnect Real-time 3dhuman pose estimation with a single rgb camerardquo ACM Transactionson Graphics vol 36 no 4 pp 441 ndash 4414 2017

[27] A Paszke S Gross S Chintala G Chanan E Yang Z DeVito Z LinA Desmaison L Antiga and A Lerer ldquoAutomatic differentiation inpytorchrdquo 2017

[28] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in CVPR vol 1 IEEE 2005 pp 886ndash893

[29] D P Kingma and J Ba ldquoAdam A method for stochastic optimizationrdquo2014

I Introduction

II Related work

II-A Pressure-Image-based Work

II-B Data-driven Human Pose Estimation

II-C Monte Carlo Dropout in ConvNets

III Method

III-A ConvNet Architecture

III-B Direct Joint Regression

III-C Deep Kinematic Embedding

III-C1 Human Kinematics Model

III-C2 Angular Latent Space

III-D Pressure Image Ambiguity

III-E Uncertainty Monte Carlo Dropout

IV Evaluation

IV-A Data Augmentation

IV-B Baseline Comparisons

IV-B1 K-Nearest Neighbors

IV-B2 Ridge Regression

IV-C Implementation Details of Proposed ConvNets

IV-C1 Pre-trained ConvNet

IV-C2 Direct ConvNet

IV-C3 Kinematic ConvNet Constant Link Length

IV-C4 Kinematic ConvNet Regress Link Length

IV-D Measure of Uncertainty

V Results

V-A Measure of Uncertainty

VI Discussion and Limitations

VI-A Network Architecture Considerations

VI-B Data Augmentation Challenges

VI-C Dataset Considerations

VI-D Removal of High Variance Joints

VII Demonstration with PR2 Robot

VIII Conclusion

References

challenging issue for the pressure image modality In thiswork we consider common poses where this issue arisessuch as an arm raised in the air resembling a double invertedpendulum Among other poses this case demonstrates wherethe pressure image can be similar for different configurationsof the arm In such a case the pressure data may beinsufficient to confidently estimate the pose of the armAn estimate of confidence or model uncertainty can bevaluable for example to allow an assistive robot to reject low-confidence estimates by removing them from a list of potentialgoals in task plan execution To estimate model uncertaintywe use Monte Carlo dropout a method proposed by Gal andGhahmarani [8] With this method we perform a numberof stochastic forward passes through the ConvNet duringtest time and compute the joint position and joint positionconfidence from the moments of the output distribution

II RELATED WORK

Markerless human pose estimation is a challenging problemcomplicated by environment factors the human pose config-urations of interest and data type Relatively few researchershave used pressure images for human pose estimation in bed[4]ndash[6] while many used cameras in myriad environments andposes [9]ndash[21] Researchers have increasingly explored data-driven methods such as ConvNets from model-free networksto inclusion of models in various architectures Here wediscuss research with pressure images data-driven methodsand measuring network uncertainty

A Pressure-Image-based Work

Prior pressure-image-based pose estimation work has fit2D kinematic models to pressure image features In a seriesof papers including [4] Harada et al used a kinematic modelto create a synthetic database for comparison with groundtruth pressure images Grimm et al [5] identified humanorientation and pose using a prior skeleton model Similarto our motivation and findings they used a pressure mat tocompensate for bedding occlusion and observed higher errorfor lighter joints (eg the elbow) which had a relatively lowpressure Liu et al [6] generated a pictorial structures modelto localize body parts on a flat bed

A few researchers have also looked at human postureclassification from pressure images [5] [22] [23] Postureclassification is a different problem from 3D body poseestimation but it can be used in a complementary way egby providing a prior on the model used for pose estimation

B Data-driven Human Pose Estimation

Like the pressure-image-based research we use a humanbody model but take a data-driven approach more commonin vision-based work While infrared and depth images haveseen recent attention in human pose estimation [9] a largebody of vision-based work uses monocular RGB cameras [10]Our method builds upon research with monocular RGB imageinput we note the following similarities and differences

bull Single image Both monocular RGB-image-based workand our pressure-image-based work has a single inputarray

bull Under-constrained For 3D human pose both monocularRGB images and single pressure images are under-constrained

bull Data content The data encoding is fundamentallydifferent For example in the context of pose estimationlight intensity in an RGB image is highly disconnectedfrom pressure intensity

bull Dimensionality RGB images typically have more fea-tures and a higher resolution than pressure images

bull Warped spatial representation Calibrating RGB camerasto alleviate distortion is straightforward In contrast itis challenging to determine the configuration of a clothpressure sensing mat from its pressure image in the caseof folds or bends

While the differences may limit the transferability ofmethods across data types we find that some data-drivenmethods previously used for vision modalities are applicableto pressure images Researchers have performed 3D humanpose estimation with monocular RGB images using standardmachine learning algorithms such as ridge regressors [11]ndash[13] In particular Okada and Soatto [11] use kernel ridgeregression (KRR) as well as linear ridge regression (LRR)Further Ionescu et al [12] compare K-Nearest Neighbors(KNN) and ridge regression on the Human36M datasetWe use these classical approaches to provide a baselinecomparison for our proposed method

Recently with the advent of high quality labeled syn-chronous datasets such as Human36M [12] many researchershave explored deep learning methods such as end-to-endtraining of ConvNets [14]ndash[17] Two common ConvNetapproaches include direct regression to joint labels [14] [21]and regression to discretized confidence maps [15] [16][18] Within 3D human pose estimation research confidencemap approaches include Pavlakos et al [16] who train aConvNet end-to-end on a 3D confidence voxel space andZhou Zhu et al [19] and Bogo et al [20] who fit a 3D modelto 2D confidence maps However the high dimensional outputspace of confidence maps can make real-time pose estimationdifficult Li and Chan [21] used rapid direct regression to 3DCartesian joint positions we implement a similar architecturebecause real-time estimation is important to our plannedapplication Zhou Sun et al [17] take a hybrid approach bytraining a ConvNet end-to-end and enforcing anthropomorphicconstraints with an embedded human skeleton kinematicsmodel with constant link lengths We implement a methodof this form for comparison and introduce a new architecturewith variable skeleton link lengths to allow the model toadapt to differently sized people

C Monte Carlo Dropout in ConvNets

We use Monte Carlo dropout to measure network uncer-tainty a method introduced by Gal and Ghahramani [8]Monte Carlo dropout has been applied to measure uncertaintyfor camera relocalization [24] and semantic segmentation[25] We use this method to estimate pose and provide ameasure of confidence


III METHOD









Lossdirect =

Nsumj=1





Nsumj=2














IV EVALUATION






A Data Augmentation





















V RESULTS
































VIII CONCLUSION


ACKNOWLEDGMENT


REFERENCES






























I Introduction

II Related work




III Method








IV Evaluation











V Results








VIII Conclusion

References


III METHOD









Lossdirect =

Nsumj=1





Nsumj=2














IV EVALUATION






A Data Augmentation





















V RESULTS
































VIII CONCLUSION


ACKNOWLEDGMENT


REFERENCES






























I Introduction

II Related work




III Method








IV Evaluation











V Results








VIII Conclusion

References












IV EVALUATION






A Data Augmentation





















V RESULTS
































VIII CONCLUSION


ACKNOWLEDGMENT


REFERENCES






























I Introduction

II Related work




III Method








IV Evaluation











V Results








VIII Conclusion

References



A Data Augmentation





















V RESULTS
































VIII CONCLUSION


ACKNOWLEDGMENT


REFERENCES






























I Introduction

II Related work




III Method








IV Evaluation











V Results








VIII Conclusion

References































VIII CONCLUSION


ACKNOWLEDGMENT


REFERENCES






























I Introduction

II Related work




III Method








IV Evaluation











V Results








VIII Conclusion

References

Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever, Ariel Kapusta,...

Documents

Transcript of Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever, Ariel Kapusta,...

Henry M. Clever*, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever*, Ariel Kapusta,...

Documents

Transcript of Henry M. Clever*, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever*, Ariel Kapusta,...

Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever, Ariel Kapusta,...

Transcript of Henry M. Clever, Ariel Kapusta, Daehyung Park, Zackory ... · Henry M. Clever, Ariel Kapusta,...