PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians...

8
1940 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 2, APRIL2019 PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians in Complex Urban Intersections Wonhui Kim , Manikandasriram Srinivasan Ramanagopal , Charles Barto, Ming-Yuan Yu , Karl Rosaen, Nick Goumas, Ram Vasudevan, and Matthew Johnson-Roberson Abstract—This letter presents a novel dataset titled PedX, a large-scale multimodal collection of pedestrians at complex urban intersections. PedX consists of more than 5 000 pairs of high-resolution (12MP) stereo images and LiDAR data along with providing two-dimensional (2-D) image labels and 3-D labels of pedestrians in a global coordinate frame. Data were captured at three four-way stop intersections with heavy pedestrian–vehicle interaction. We also present a 3-D model fitting algorithm for automatic labeling harnessing constraints across different modalities and novel shape and temporal priors. All annotated 3-D pedestrians are localized into the real-world metric space, and the generated 3-D models are validated using a motion capture system configured in a controlled outdoor environment to simulate pedestrians in urban intersections. We also show that the manual 2-D image labels can be replaced by state-of-the-art automated labeling approaches, thereby facilitating automatic generation of large scale datasets. Index Terms—Computer vision for transportation, human detection and tracking. I. INTRODUCTION D RIVING in complex urban environments is one of the ma- jor challenges for autonomous vehicles (AVs). For AVs to operate in an environment crowded with people, understanding pedestrian pose, motion, behavior, and intention will greatly increase our ability to function safely and efficiently. In computer vision, estimating human pose has been a long standing problem. The recent application of deep neural networks has generated state-of-the-art results for 2D body Manuscript received September 10, 2018; accepted January 10, 2019. Date of publication January 31, 2019; date of current version February 28, 2019. This letter was recommended for publication by Associate Editor J. McDonald and Editor C. Cadena Lerma upon evaluation of the reviewers’ comments. This work was supported by a grant from Ford Motor Company via the Ford-UM Alliance under award N022884. (Corresponding author: Wonhui Kim.) W. Kim, M. Srinivasan Ramanagopal, C. Barto, M.-Y. Yu, K. Rosaen, and N. Goumas are with the UM-Ford Center for Autonomous Vehicles Laboratory, University of Michigan, Ann Arbor, MI 48105 USA (e-mail:, wonhui@umich. edu; [email protected]; [email protected]; [email protected]; krosaen@ umich.edu; [email protected]). R. Vasudevan is with the Faculty of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48105 USA (e-mail:, [email protected]). M. Johnson-Roberson is with the Faculty of Naval Architecture and Ma- rine Engineering, Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48105 USA (e-mail:, [email protected]). Digital Object Identifier 10.1109/LRA.2019.2896705 pose estimation [1], which has inspired extensions to the 3D pose estimation [2]–[6]. However, gathering ground truth 3D pose data is challenging. Motion capture (mocap) systems have been the primary generator of ground truth 3D pose data, but have restricted the variety and complexity of the 3D scenes that can be captured [7], [8]. For example, with mocap systems it is difficult to capture naturalistic in-the-wild scenes with groups of people who are moving and interacting. To overcome those technical limitations, this letter develops both a dataset and a ground truth generation approach to facilitate generating 3D poses on in-the-wild images without relying on mocap. Most AVs have cameras installed, so this data can serve as a primary source for human pose estimation using computer vision algorithms. In addition to cameras, LiDAR (Light De- tection and Ranging) has become an essential component for AVs due to its precise depth measurements. This motivated the importance of capturing both modalities for this benchmark set of complex urban intersections. Our dataset has the following unique properties. First, the data are gathered outdoors with real challenges such as varied lighting and weather conditions and the presence of occlusions. Second, the pedestrian data are collected at intersection length scales of up to 45 m range, which is relevant for deployment of pose estimation systems at application relevant scales. The captured scenes are also naturalistic. The pedestrians in our dataset are not actors, so they move and interact in a myriad of realistic ways. In addition, our dataset includes multi-person images capturing crowds of people. Lastly, we release mul- timodal pedestrian data including high resolution images and point clouds that are synchronously captured from stereo cam- eras and LiDAR sensors. Annotations of our dataset also have distinctive features. All the annotated 3D pedestrians lie in a global metric-space coor- dinate frame, as opposed to many existing datasets that operate in hip joint or camera center relative coordinate frames. We stress the importance of determining where a person is in the 3D world so one can plan actions around them. In addition, our multimodal data frames are captured in minutes long sequences with unique tracking IDs for each pedestrian, enabling temporal reasoning. The contributions of this letter are summarized as follows: 1. We release a publicly available large-scale multimodal pedestrian dataset with a rich set of 2D and 3D annotations. The dataset captures the real world challenges that AVs are likely to face at urban intersections. 2377-3766 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Transcript of PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians...

Page 1: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

1940 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 2, APRIL 2019

PedX: Benchmark Dataset for Metric 3-D PoseEstimation of Pedestrians in Complex

Urban IntersectionsWonhui Kim , Manikandasriram Srinivasan Ramanagopal , Charles Barto, Ming-Yuan Yu , Karl Rosaen,

Nick Goumas, Ram Vasudevan, and Matthew Johnson-Roberson

Abstract—This letter presents a novel dataset titled PedX,a large-scale multimodal collection of pedestrians at complexurban intersections. PedX consists of more than 5 000 pairs ofhigh-resolution (12MP) stereo images and LiDAR data along withproviding two-dimensional (2-D) image labels and 3-D labels ofpedestrians in a global coordinate frame. Data were captured atthree four-way stop intersections with heavy pedestrian–vehicleinteraction. We also present a 3-D model fitting algorithmfor automatic labeling harnessing constraints across differentmodalities and novel shape and temporal priors. All annotated3-D pedestrians are localized into the real-world metric space, andthe generated 3-D models are validated using a motion capturesystem configured in a controlled outdoor environment to simulatepedestrians in urban intersections. We also show that the manual2-D image labels can be replaced by state-of-the-art automatedlabeling approaches, thereby facilitating automatic generation oflarge scale datasets.

Index Terms—Computer vision for transportation, humandetection and tracking.

I. INTRODUCTION

DRIVING in complex urban environments is one of the ma-jor challenges for autonomous vehicles (AVs). For AVs to

operate in an environment crowded with people, understandingpedestrian pose, motion, behavior, and intention will greatlyincrease our ability to function safely and efficiently.

In computer vision, estimating human pose has been along standing problem. The recent application of deep neuralnetworks has generated state-of-the-art results for 2D body

Manuscript received September 10, 2018; accepted January 10, 2019. Dateof publication January 31, 2019; date of current version February 28, 2019.This letter was recommended for publication by Associate Editor J. McDonaldand Editor C. Cadena Lerma upon evaluation of the reviewers’ comments. Thiswork was supported by a grant from Ford Motor Company via the Ford-UMAlliance under award N022884. (Corresponding author: Wonhui Kim.)

W. Kim, M. Srinivasan Ramanagopal, C. Barto, M.-Y. Yu, K. Rosaen, andN. Goumas are with the UM-Ford Center for Autonomous Vehicles Laboratory,University of Michigan, Ann Arbor, MI 48105 USA (e-mail:,[email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

R. Vasudevan is with the Faculty of Mechanical Engineering, University ofMichigan, Ann Arbor, MI 48105 USA (e-mail:,[email protected]).

M. Johnson-Roberson is with the Faculty of Naval Architecture and Ma-rine Engineering, Electrical Engineering and Computer Science, University ofMichigan, Ann Arbor, MI 48105 USA (e-mail:,[email protected]).

Digital Object Identifier 10.1109/LRA.2019.2896705

pose estimation [1], which has inspired extensions to the 3Dpose estimation [2]–[6]. However, gathering ground truth 3Dpose data is challenging. Motion capture (mocap) systemshave been the primary generator of ground truth 3D posedata, but have restricted the variety and complexity of the 3Dscenes that can be captured [7], [8]. For example, with mocapsystems it is difficult to capture naturalistic in-the-wild sceneswith groups of people who are moving and interacting. Toovercome those technical limitations, this letter develops botha dataset and a ground truth generation approach to facilitategenerating 3D poses on in-the-wild images without relying onmocap.

Most AVs have cameras installed, so this data can serve asa primary source for human pose estimation using computervision algorithms. In addition to cameras, LiDAR (Light De-tection and Ranging) has become an essential component forAVs due to its precise depth measurements. This motivated theimportance of capturing both modalities for this benchmark setof complex urban intersections.

Our dataset has the following unique properties. First, thedata are gathered outdoors with real challenges such as variedlighting and weather conditions and the presence of occlusions.Second, the pedestrian data are collected at intersection lengthscales of up to 45 m range, which is relevant for deploymentof pose estimation systems at application relevant scales. Thecaptured scenes are also naturalistic. The pedestrians in ourdataset are not actors, so they move and interact in a myriadof realistic ways. In addition, our dataset includes multi-personimages capturing crowds of people. Lastly, we release mul-timodal pedestrian data including high resolution images andpoint clouds that are synchronously captured from stereo cam-eras and LiDAR sensors.

Annotations of our dataset also have distinctive features. Allthe annotated 3D pedestrians lie in a global metric-space coor-dinate frame, as opposed to many existing datasets that operatein hip joint or camera center relative coordinate frames. Westress the importance of determining where a person is in the3D world so one can plan actions around them. In addition, ourmultimodal data frames are captured in minutes long sequenceswith unique tracking IDs for each pedestrian, enabling temporalreasoning.

The contributions of this letter are summarized as follows:1. We release a publicly available large-scale multimodal

pedestrian dataset with a rich set of 2D and 3D annotations.The dataset captures the real world challenges that AVsare likely to face at urban intersections.

2377-3766 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

KIM et al.: PEDX: BENCHMARK DATASET FOR METRIC 3D POSE ESTIMATION OF PEDESTRIANS IN COMPLEX URBAN INTERSECTIONS 1941

Fig. 1. An example image from the PedX dataset with bounding boxes aroundthe pedestrians (left) and a rendered image with the 3D human mesh models inmetric space. (right)

2. We present an automatic method to obtain full 3D labelsfrom 2D data, enabling labeling of in-the-wild imageswithout mocap. Using state of the art algorithms for 2Dannotations, our proposed approach allows generating 3Ddata in a completely unsupervised manner.

3. Our automatic 3D labeling method is validated using amocap system in a controlled outdoor environment thatsimulates pedestrians in urban intersections.

We present this dataset to enable the study of 3D pose estima-tion while reasoning about pedestrian behavior around vehicles,particularly in crowded urban areas as depicted in Fig. 1. Wesee this as one of the first areas where human pose estimationcan have a tremendous impact on safety and intelligence of mo-bile robot systems. Understanding the pose of road users affordsinformation about activity, attention, and predictions of futureposition which are critical to safely navigate around humans.

II. RELATED WORK

A. 3D Human Pose Estimation

In many letters, 3D human pose estimation has been formu-lated as a problem of regressing 3D joint locations by directlyextracting visual features from an image [2]–[5], [9]–[12], or bylifting 2D joint detector outputs to 3D joints in a camera rela-tive frame [6], [13], [14]. To reduce the inherent ambiguity of3D human pose estimation from a 2D image, prior knowledgeabout feasible human poses has been considered [13], [15], ora deformable 3D human model such as Skinned Multi-PersonLinear (SMPL) [16] has been used to be fit to known 2D jointlocations [14].

More recently, 3D poses have been estimated as part of richerand denser mesh representations. SMPL model parameters areestimated from given dense 2D annotations [17], or the param-eters are directly predicted from a single RGB image in an end-to-end manner using an adversarial learning framework [18].Guler et al. [19] use a variant of SMPL parameterization bytransforming it into part-based UV coordinates that specify abijective mapping between mesh surface and an image. The pro-posed DensePose network assigns a body part to each pixel andregresses the corresponding surface coordinates. As a model-free approach, Rematas et al. [20] reconstruct 3D pointcloudsby estimating a per-pixel depth map for each soccer player from

a monocular video using a neural network trained on syntheticdata.

Most approaches take as input an image with a single per-son or a cropped image patch centered around a person, andreturn as output 3D pose in a root-relative coordinate framewhere the camera is facing toward the person. While the jointestimation of 3D pose and virtual camera parameters have beenproposed [5], [13], the outputs still suffer from the scale am-biguity. Without knowledge of exactly how far away a per-son is, controlling a mobile robot safely will be challenging.Most approaches are also unable to handle multi-person imageswith a single pass of an algorithm. While the multi-person 3Dpose estimation problem has recently been addressed [19], [21],the output 3D poses are root-relative and still reliable up to ascale.

A major impediment to predicting metric space pose for mul-tiple people in a scene is the lack of a suitable dataset withreliable 3D annotations. Addressing this limitation is the focusof this letter.

B. 3D Human Pose Datasets

Large scale datasets have played an essential role in fuelingrecent progress in a variety of computer vision tasks. However,building a ground truth 3D human pose dataset in metric space ischallenging since annotation in 3D is far more time consumingthan the same task in 2D. To avoid manual labeling in 3D, mocapsystems are typically used to obtain the ground truth 3D humanpose [7], [8], [22]–[24].

The data captured with traditional mocap systems has manylimitations. For instance, markers must be attached to subjectswhich makes images look unnatural. Moreover since mocap istypically restricted to constrained, mostly indoor areas, imagebackgrounds for mocap datasets have limited variability. Thenumber of markers that can be tracked by mocap systems is alsolimited, which limits the number of subjects. The recent devel-opment of commercial markerless mocap system [25] allowedthe data capture with natural-looking subjects [24], but it stillsuffers from the limited number of subjects and the size of cap-turing volume. Additional sensors such as IMUs [26], [27] havebeen used to obtain ground truth 3D human poses of in-the-wildimages, but the number of subjects per image is small and thedata is only limited to humans.

To counter the limited variability in subject appearance, cam-era viewpoints, lighting conditions and image backgrounds, syn-thetic 3D datasets have been proposed [28], [29]. While therehave been attempts to improve the photo-realism of these syn-thetic data generation pipelines [24], [30], [31], state-of-the-artsynthetic images are still easy to distinguish from real images.Others have explored techniques to automatically do 3D label-ing with limited human intervention [17], [32], [33]. For in-stance, some have taken advantage of a multiview camera setupto obtain reliable 3D annotations using optimization [32]. Oth-ers explore fitting parameterized mesh models [14] to monoc-ular images or multiview images to collect the ground truth3D labels [17], [33]. However, these methods provide 3D scalemodels for a virtual camera relative frame and typically forimages from various 2D pose datasets cropped around a sin-gle detection. This limits the potential utility of these methodswhile performing mobile robotic tasks safely. In contrast, weconstruct 3D human pose models in metric space for crowded

Page 3: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

1942 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 2, APRIL 2019

Fig. 2. Visualization of our annotations. For each pedestrian, we provide 2D segmentation, 2D joint locations with visibility of 18 body joints, tracking ID,time-synced LiDAR points, and 3D mesh model localized into the global coordinate frame.

TABLE ISTATISTICS AND CHARACTERISTICS OF RELATED DATASETS

urban street intersections that include as many as 15 pedestriansin a single image at distances as far as 45 m from the camera.

III. PEDX DATASET

The PedX dataset contains more than 5,000 pairs of high-resolution stereo images with 2,500 frames of 3D LiDAR point-clouds. The cameras and LiDAR sensors are calibrated andtime synchronized. We selected three four-way stop intersec-tions with heavy pedestrian-vehicle interaction. Cameras areinstalled on the roof of the car to obtain driver-perspective im-ages. To cover all four crosswalks at an intersection, the imageswere captured by two pairs of stereo cameras – one pair facingforward and another facing the incoming road from the left.Our dataset includes more than 14,000 pedestrian models witha distance of 5–45 m from the cameras and we provide reliable2D and 3D labels for each instance. Table I presents statisticsof our dataset in comparison to other publicly available 3D hu-man pose datasets. Further details about data acquisition andour dataset are available online [34].

Instance-level 2D segmentations and body joint locations aremanually labeled for all images by instructed annotators. Wealso provide the unique tracking ID for each instance across theframes if the pedestrian appears in consecutive frames. A 2Dsegmentation is labeled by outlining a single connected poly-gon to cover the entire visible area of an object. To label thekeypoints, 18 body joints are selected including 4 facial com-ponents. In cases where some keypoints are invisible due toocclusion by other objects in the scene or self-occlusion, anno-tators made a reasonable guess about the joint location, but alsoindicated the degree of occlusion.

We use the SMPL [16] parameterization to represent our 3Dannotations which consist of shape and pose parameters andadditionally estimate the global location and orientation of theinstance in the global coordinate frame. The best fit 3D SMPL

Fig. 3. Distributions of (a) camera-to-pedestrian distance and (b) body orien-tation relative to the world coordinate frame.

models were computed using the 2D annotations from a pair ofstereo images and LiDAR points. The obtained 3D models en-code pose, shape, and global location in 3D metric space withoutscale ambiguities. Fig. 2 illustrates 2D and 3D annotations fromour dataset. The automatic algorithm to determine the optimalSMPL parameters is discussed in the next section.

IV. MULTI-MODAL 3D MODEL FITTING

We perform 3D model fitting on pedestrians in a stereo-LiDAR sequence. In contrast to the previous work [14] thatfits a 3D model to a single frame at a time, our approach opti-mizes over a sequence of stereo images and LiDAR points. Webegin by per-instance model fitting which is then extended tooptimize over a sequence of instances using an iterative method.We also propose multi-modal and temporal priors. Note that weuse a gender-neutral model for model fitting. Before initiatingthe model fitting pipeline, we preprocess the LiDAR data toidentify regions containing potential pedestrians in 3D space.The points that are projected out of images or those that be-long to the static objects are removed. Using 2D segmentationlabels for stereo images with known transformation betweenLiDAR and camera coordinate frames, we perform the pointcloud labeling of each pedestrian instance.

A. Fitting to a Single Instance (at a Single Time Step)

We begin by performing 3D model fitting to a single instanceat a single time step. For each pedestrian instance, we are given2D joint locations xl and xr , and 2D segmentations Sl and Sr foreach stereo image. We also have sparse 3D points correspondingto the instance. To find the pose θ, shape β, and 3D globalposition t that best fit to the instance, we formulate the problem

Page 4: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

KIM et al.: PEDX: BENCHMARK DATASET FOR METRIC 3D POSE ESTIMATION OF PEDESTRIANS IN COMPLEX URBAN INTERSECTIONS 1943

Fig. 4. PedX labeling pipeline.

as:

minimizeθ,β,t

EI (θ,β, t) (1)

where EI = EJ + E3d + EP + ET + ED represents the sumof multiple energy terms. We verify the effectiveness of eachenergy term through ablative experiments described in Sec. V-B.EJ is the sum of robust 2D reprojection errors [14] for both leftand right images, EP is the prior term, E3d is the 3D Euclideandistance term between visible SMPL vertices and the LiDARpoints, ET is the translation term to constrain the 3D modellocation, and ED is the heading direction term to constrain thebody orientation:

ET (t) = ‖t − t0‖22 , (2)

ED (θ) = ‖f(θ) − d‖22 , (3)

E3d = 1Nv

Nv∑

i

minj

‖X i − V j‖22 , (4)

where t0 is the mean of 3D points, f is a function to convert theaxis-angle representation of body orientation to xyz- directionalvector, d is a known heading direction vector, X i is the i-thLiDAR point, Nv is the total number of 3D points that belongto the instance, and Vj is the j-th point of SMPL model vertices.

B. Fitting to a Sequence of Single Instances

To fit 3D models to a sequence of detections, we developshape and temporal consistency constraints across frames inaddition to the per-frame constraints.

1) Global Shape Consistency: Suppose one pedestrian ap-pears in N consecutive frames with full 2D labels. In this in-stance, while pose parameters and translations change, the shapeparameters should remain unchanged across the sequence. Tofind the pose, shape parameters and translations, we formulatethe problem as:

minimizeθ,β,T

Eseq (Θ,β,T ) (5)

where Θ = {θ1 , . . . ,θN } and T = {t1 , . . . , tN } are the set ofpose parameters and the set of translations for all N frames. βis the shape parameters shared by all frames where β = β1 =· · · = βN . Optimizing over the entire sequence is challengingdue to the high dimension of the decision variables. Since theobjective Eseq is separable in terms of a per-frame objective, wedecompose this large problem into a set of smaller problems.

We rewrite the unconstrained minimization problem over theentire sequence by introducing the consensus variable β:

minimizeθ1 :N ,β1 :N ,t1 :N ,β

N∑

k=1EI,k (θk ,βk , tk )

subject to βk − β = 0, k ∈ {1, . . . , N} (6)

where k denotes the frame ranging from 1 to N frames in a se-quence. This optimization is a constrained minimization prob-lem with a separable objective function and multiple constraintsthat require that each per-frame shape parameter βk is equal.The advantage of introducing a consensus variable β is that wecan enforce all the frames to have common shape parameterswhile exploiting parallelism. We solve the problem by using thealternating direction method of multipliers (ADMM) [35]. Theaugmented Lagrangian to be minimized is:

Lρ (β,βk ;Θ,T ) =N∑

k=1

EI,k (θk ,βk , tk ) + yTk (βk − β)

2‖βk − β‖2

2 (7)

yk is the dual variable for βk and ρ is a positive constant that isexperimentally selected. ρ = 2 was used for the results reportedin this letter. The objective Lρ is optimized using an alternatingoptimization for the local and global shape parameters withvariables {uk}N

k=1 where uk = yk

ρ . The update equations ateach iteration are as follows:

βt+1k := argminβk

EI , k (θk ,βk ,tk )+ρ2 ‖βk − βt + ut

k‖22 (8)

βt+1 := 1N

N∑

k=1

(βt+1

k + utk

)(9)

ut+1k := ut

k + βt+1k − βt+1 (10)

We perform the synchronous update for the global shape pa-rameters. The iteration is stopped when ‖βt+1

k − βt+1‖2 <0.05 and ρ‖βt − βt+1‖2 < 0.05 or the maximum iteration isreached. (8) is similar to per-frame minimization in (1) with anadditional term in the objective function.

2) Temporal Pose Prior: In addition to enforcing per-frameshape parameters to share the common values across the se-quential frames, we propose a temporal pose prior to give apenalty to unlikely sequences of poses. A 72-dimensional posevector consists of xyz rotation angles of 23 joints relative toeach of their parent nodes, plus the orientation of the root hip

Page 5: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

1944 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 2, APRIL 2019

Fig. 5. (a) Overhead view of the pointcloud trajectories of pedestrians in the global frame. (b) Initialized body orientation based on heading direction, and thefinal 3D models after the iterations.

in 3D angle-axis representation. We observe that the differencebetween the pose vectors from two consecutive frames is smalland has some patterns common to individual joints. Especiallyfor pedestrians at the intersection, many of them are involved inwalking or other actions at slow speed. Since the difference inrotation angles can also be affected by the translation velocityof pedestrians, we define a 75-dimensional vector with the first3 components consisting of the difference in body translation:Δx = (Δt,Δθ).

We fit a Gaussian mixture models with 10 distinct distribu-tions for the pose difference vectors using the CMU mocapdataset [23]. Since the frame rate of the mocap and our datacapture is different, we use the mocap frame rate when estimat-ing this pose difference vector. We include the negative log ofthis multivariate normal probability distribution as part of theprior in the objective term:

Δxt = xt − xt−1 ∼N∑

i

wi N (Δxt ;μi,Σi) (11)

Etp(tk ,θk ; tk -1 ,θk -1) = − logN∑

i

wiN (Δxt ;μi,Σi) (12)

where μi , Σi are the mean and covariance of the pose differencevector Δx and wi is the weight for the i-th Gaussian mixturecomponent.

C. Initialization

One of the challenges while fitting a 3D model is estimating abody orientation with the incorrect sign [33]. To avoid getting aflipped model, we compute the heading direction of each pedes-trian from a sequence of LiDAR points and use it to initialize thebody orientation via a 3-dimensional angle-axis representation.We assume that pedestrians never move backwards which heldtrue in our data capture.

Fig. 5a shows some example trajectories from LiDAR points,and Fig. 5b shows projected 3D model onto the left images atfour sequential time frames after the initialization. When a se-quence is only a single frame, we find the initial body orientationwith the template pose by minimizing the stereo reprojection er-ror EJ only using torso body joints and the translation error ET

around the mean LiDAR points.

V. EXPERIMENTS

The data in our dataset is captured at complex outdoor ur-ban intersections where pedestrian-to-camera distances are large(5–45 m), with multiple subjects who are often heavily occluded.In contrast, publicly available 3D datasets rely on mocap sys-tems [7], [8], [22] and guarantee less than a few mm accuracyfor a single subject appearing within a controlled indoor cap-ture volume of a few meters in radius. Given these distinctions,

the accuracy of our dataset needs to be evaluated under theserealistic conditions. While comparing the accuracy of our pro-posed approach with a mocap system would reliably validateour dataset, mocap systems cannot be practically setup at urbanintersections.

We address this challenge by leveraging the fact that our pro-posed approach only requires 2D labels and LiDAR data withoutusing image features. The key factors affecting our approach in-clude the density of the LiDAR returns due to the large distancefrom the capture vehicle, occlusions by other objects that affectthe LiDAR segmentation, the precision of the manual annotationwhen the pedestrian occupies a small portion of the image andthe calibration between LiDAR and camera. The lighting condi-tions, background appearance, clothing and weather conditionsdo not affect our approach as we operate on manually labeledjoint locations. Therefore, we collect and annotate an evaluationdataset in a controlled outdoor environment with a mocap sys-tem and get manual 2D labels while replicating the vehicle totarget distances and the clutter and occlusion of the intersectiondata. By showing that the 3D labels generated by our methodare comparable to the mocap ground truth, we verify the fittingapproach and the annotation process against a traditional mocapsource.

A. Data Verification

1) Evaluation Dataset: We use the PhaseSpace mocap sys-tem with active LED markers which can be used in outdoorenvironments. The subject wears a suit with markers placedaround body parts and repeats actions such as walking, jogging,and waving that are common for pedestrians. The capture ve-hicle was parked about 20 m away from the mocap setup. Toreplicate typical occlusions we parked another car between thecapture vehicle and the mocap setup as well as having groups ofpedestrians walking. We selected 626 frames and obtained man-ual 2D labels for the images. Since the visual appearance doesnot affect the evaluation, we restrict the evaluation to a singlesubject with a single background and focus more on variationin poses and occlusions.

2) Evaluation Metric: The 3D mean per joint position er-ror (MPJPE) is a standard metric to evaluate pose estimationalgorithms which is a mean over all joints of the Euclidean dis-tance between ground truth and prediction. In cases where theprediction is not in metric space, the error is computed for aroot-relative coordinate frame after allowing a similarity trans-form to register the prediction to the ground truth. In caseswhere the prediction is in metric space, we compute MPJPEin global coordinate frame without any registration. We furtherreport the per joint position errors for both frames. Note that,given the geometry of capture, markers can get completely oc-cluded from the mocap system. Consequently, not all joints arevisible in all frames. Moreover, some methods may not predict

Page 6: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

KIM et al.: PEDX: BENCHMARK DATASET FOR METRIC 3D POSE ESTIMATION OF PEDESTRIANS IN COMPLEX URBAN INTERSECTIONS 1945

TABLE II3D MPJPE IN ROOT-RELATIVE COORDINATE FRAMES

TABLE III3D MPJPE IN GLOBAL COORDINATE FRAMES

invisible joints. Therefore, we take the weighted mean whilecomputing the MPJPE where the weight is equal to the numberof frames in which the joint was visible in the ground truth andwas predicted.

3) Baseline Methods: We consider three different familiesof baseline methods. First, we consider a method that predicts3D joint coordinates (up to scale) directly from 2D images [3].Second, we consider methods that take manual 2D joint anno-tations as inputs [6], [14]. We evaluate these methods in theroot relative frame alone. Third, we consider three naive base-lines that use stereo geometry information. As we have 2D jointlocations for a calibrated rectified stereo pair of images, wedirectly triangulate these 2D joint locations for visible joints.We refer to this method as Triangulation. For the second naivebaseline, we use disparity values and 2D joint locations in theleft image for the visible joints and the previous triangulationresult for invisible joints. We refer to this as Left+disp. Finally,we consider a baseline that modifies an existing technique [14],this approach uses the calibrated camera parameters and the es-timated skeletons which are scaled to metric space by using theaverage disparity values at the visible joint locations. We referto this as SMPLify [14]+disp.

4) Accuracy: Table II and Table III summarize the resultsfor root-relative and global frames respectively. Although faircomparisons can only be done between methods separated byhorizontal lines, the objective of these tables is to highlightthat the current state-of-the-art still has room for improvementand the utility of the proposed dataset in closing this gap. Theproposed approach achieves lower MPJPE for the majority ofjoints in the root-relative frame. This is expected as our ap-proach leverages additional information including LiDAR re-turns, temporal priors as well as stereo annotations. The gainswhile using this additional data are most prominent in the globalcoordinate frame as no registration is involved and consequentlyglobal translation, orientation and scale errors can be seen inthe global MPJPE. While naive baselines such as triangulationand left+disp perform poorly as they do not leverage any priorabout proportions of a typical human skeleton, SMPLify+dispwhich leverages priors about human skeletons still suffers fromlarge errors. In contrast, our proposed approach achieves anMPJPE of 194mm for an average camera-to-pedestrian distanceof 20 × 103mm.

TABLE IVABLATION STUDY ON ENERGY TERMS. EACH COLUMN SHOWS PER-JOINT

ERRORS IN MM IN ROOT-RELATIVE FRAMES. THE LAST COLUMN SHOWS THE

MPJPE IN GLOBAL FRAME

Fig. 6. Results from using different cost terms. The resulting 3D models inthe temporal walking sequence were projected and overlaid onto the images.

B. Ablation Study

Table IV summarizes the results for using different subsetsof energy terms in the optimization. Note that EJ,l and EJ,r

represents the reprojection error on left and right images. ET ,E3D and Etp are defined in Sec. IV. Each column shows per-joint errors in root-relative frame except the last column whichshows the MPJPE in global frame.

1) Effect of Stereo: To see how using stereo reprojection er-ror affects the resulting 3D models, we compute reprojectionerrors for only the left images (i.e. row 4 of Table IV) and forboth stereo images (i.e. row 5 of Table IV). Stereo imagery re-duced both global translation error and root-relative pose erroras it reduces the depth ambiguities that exist for the monocu-lar approach. The second row of Fig. 8 shows the results frommonocular approach which estimates 3D models from the leftimages. Notice that in several frames, the legs are swapped orthe body orientation is estimated incorrectly. Fig. 6 presentsadditional results with occlusions, and illustrates similar limi-tations of the existing approach. We can see that the projectionof the estimated 3D models onto the right image does not alignexactly. Estimating 3D pose from 2D joints is an inherentlyill-posed problem because there exist many feasible body con-figurations. Using stereo information provides more constraintsduring the optimization and overall, reduces such ambiguities.Furthermore, when some joints are occluded in a single image,when using stereo pairs the second image of the pair may ob-serve these joints which can reduce the uncertainty and producebetter 3D models.

Page 7: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

1946 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 4, NO. 2, APRIL 2019

2) Effect of Using LiDAR Points: Although stereo imagesprovide reasonable depth estimation in a global coordinateframe, the translation error may be too large since an errorof a few pixels during labeling may create a large resulting er-ror in fit. To place 3D models at the correct location in metricspace, we include LiDAR information as a form of the transla-tion prior ET . The translation prior term localizes the 3D modelsat the distance observed by the LiDAR. As shown in the firsttwo rows in Table IV, adding this translation prior provides animprovement in estimating global 3D pose.

The translation prior only constrains the location of the rootjoint (hip) in the 3D metric space. Therefore, depth ambiguitiesin other parts of the body may still exist, especially when thesubject appears sideways with respect to the camera. Addingthe 3D distance term E3D helps to adjust the pose or bodyorientation to fit to the observed LiDAR points. Consequentlythe mean column in the root-relative frame significantly reducesfrom row 2 to row 4.

3) Temporal Prior: The temporal prior term, Etp , penalizesunlikely transitions of poses and translations between consecu-tive frames. It also makes the resulting 3D pose between con-secutive frames appear smooth. In Table IV, row 2 and row 3,row 4 and row 5 show the errors without and with the temporalprior, respectively. By adding this term we obtain similar val-ues for the root-relative errors, and achieve lower global error.Another advantage of the temporal prior is that it makes themodel robust to 2D labeling noise for the occluded joints. Fig. 6illustrates examples with severe occlusions. It is likely that the2D body joint labels are inconsistent across the frames undersevere occlusion, and that affects the estimated pose. As seen inrows 4 and 6 of Fig. 6, the resulting 3D poses are smoother whenthis temporal prior term is included. However, if the weight oftemporal prior is set too high, the transition of poses betweenthe frames can become too restricted.

4) Global Shape Consistency: Fig. 8 compares the resultsfrom SMPLify [14] with those from our method which en-forces consistency of the global shape parameters across mul-tiple frames. As expected using the global shape consistencyconstraint produces more consistent shapes across the frames,while the resulting 3D models from SMPLify, which only usesper-frame information, looks inconsistent. Moreover, the mod-els are too skinny especially when the desired pose is far fromthe template pose.

C. Effect of Noisy 2D Labels

The size of manually labeled datasets is always limited by thetime-consuming and costly annotation process. Luckily, recentmethods have made great progress in 2D visual recognitiontasks. Using the pre-trained networks [1], [36], we computedinstance-level segmentation masks and 2D joint detections onour evaluation dataset in place of manual labels. Since onlya single subject exists in a scene wearing the mocap suit, thetracking ID is trivially obtained.

Table V shows that the 3D MPJPE error for using the 2Destimates is larger than that for using manual labels. For thehip joint, manual labels by annotators can be less consistentacross frames than the 2D algorithm outputs which might haveresulted in large error. Although using manual labels producedhigher accuracy than that of using 2D estimates, the method stillachieved greater accuracy than ad-hoc versions of monocularmethods shown in Table III. This illustrates that our proposed

TABLE VCOMPARISON OF 3D MPJPE IN ROOT-RELATIVE COORDINATE FRAME FOR

AUTOMATIC AND MANUAL 2D LABELS. THE LAST COLUMN IS THE MPJPE IN

GLOBAL COORDINATE FRAME (IN MM)

Fig. 7. Representative samples from our dataset. 3D models are rendered ontothe images.

approach is robust to noisy 2D labels and could potentiallybe scaled for larger datasets using state-of-the-art 2D visualrecognition algorithms.

D. Qualitative Results

Fig. 7 shows some representative examples from PedX thatillustrate the uniqueness and variety of our dataset. Our datasetcovers various actions and poses that are frequently encounteredat intersections. Examples include walking, jogging, waving,using a phone, cycling, carrying objects, and talking. Occlu-sions are another challenge in estimating 3D pose. Our datasetcontains many pedestrian instances with severe occlusions bysurrounding objects or by other pedestrians. In addition, mostframes of the dataset contain more than one pedestrian at atime. Our dataset also contains different weather conditions andrare occurrences such as people in wheelchairs or pedestriansjaywalking.

E. Runtime and Sequence Length

Our 3D model fitting approach involves the high-dimensionalparameter optimization. The runtime of fitting 3D models persequence goes up as the number of frames per sequence in-creases. The number of frames per sequence of a pedestrianvary, and it is more than a hundred frames for most sequences.To keep the runtime feasible, we use up to 10 neighboring framesper sequence. It takes less than 5 minutes to process a singlesequence.

Page 8: PedX: Benchmark Dataset for Metric 3-D Pose Estimation of Pedestrians …static.tongtianta.site/paper_pdf/afbd51bc-48ba-11e9-8c3e-00163e08… · with groups of people who are moving

KIM et al.: PEDX: BENCHMARK DATASET FOR METRIC 3D POSE ESTIMATION OF PEDESTRIANS IN COMPLEX URBAN INTERSECTIONS 1947

Fig. 8. Results from monocular SMPLify and from our method.

VI. CONCLUSION

In this letter, we present a novel large scale multimodal datasetof pedestrians at complex urban intersections with a rich set of2D/3D annotations. The PedX dataset provides a platform forunderstanding pedestrian behaviors at intersections with real lifechallenges. This dataset can be used to solve 3D human poseestimation, pedestrian detection and tracking in-the-wild, and tofurther expand to new problems.

REFERENCES

[1] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2-Dpose estimation using part affinity fields,” in Proc. Conf. Comput. Vis.Pattern Recognit., 2017, pp. 7291–7299.

[2] D. Mehta et al., “Vnect: Real-time 3D human pose estimation with asingle RGB camera,” ACM Trans. Graph., vol. 36, no. 4, p. 44, 2017.

[3] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei, “Towards 3D humanpose estimation in the wild: A weakly-supervised approach,” in Proc. Int.Conf. Comput. Vis., 2017, pp. 398–407.

[4] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis,“Sparseness meets deepness: 3D human pose estimation from monocularvideo,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4966–4975.

[5] C.-H. Chen and D. Ramanan, “3D human pose estimation = 2D poseestimation+ matching,” in Proc. Conf. Comput. Vis. Pattern Recognit.,2017, pp. 7035–7043.

[6] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effectivebaseline for 3D human pose estimation,” in Proc. IEEE Int. Conf. Comput.Vis., 2017, pp. 2659–2668.

[7] L. Sigal, A. O. Balan, and M. J. Black, “Humaneva: Synchronized videoand motion capture dataset and baseline algorithm for evaluation of ar-ticulated human motion,” Int. J. Comput. Vis., vol. 87, no. 1, pp. 4–27,2010.

[8] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m:Large scale datasets and predictive methods for 3D human sensing innatural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,no. 7, pp. 1325–1339, Jul. 2014.

[9] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele,“Deepercut: A deeper, stronger, and faster multi-person pose estimationmodel,” in Proc. IEEE Eur. Conf. Comput. Vis., 2016, pp. 34–50.

[10] A.-I. Popa, M. Zanfir, and C. Sminchisescu, “Deep multitask architecturefor integrated 2D and 3D human sensing,” in Proc. Conf. Comput. Vis.Pattern Recognit., 2017, pp. 6289–6298.

[11] D. Mehta et al., “Monocular 3D human pose estimation in the wild usingimproved CNN supervision,” in Proc. IEEE Int. Conf. 3D Vis., 2017,pp. 506–516.

[12] D. Tome, C. Russell, and L. Agapito, “Lifting from the deep: Convolu-tional 3D pose estimation from a single image,” in Proc. Conf. Comput.Vis. Pattern Recognit., 2017, pp. 2500–2509.

[13] V. Ramakrishna, T. Kanade, and Y. Sheikh, “Reconstructing 3D humanpose from 2D image landmarks,” in Proc. Eur. Conf. Comput. Vis. Springer,2012, pp. 573–586.

[14] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black,“Keep it SMPL: Automatic estimation of 3D human pose and shapefrom a single image,” in Proc. Eur. Conf. Comput. Vis. Springer, 2016,pp. 561–578.

[15] I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3Dhuman pose reconstruction,” in Proc. Conf. Comput. Vis. Pattern Recognit.,2015, pp. 1446–1455.

[16] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black,“SMPL: A skinned multi-person linear model,” ACM Trans. Graph.,vol. 34, no. 6, p. 248, 2015.

[17] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V.Gehler, “Unite the people: Closing the loop between 3D and 2D humanrepresentations,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2017,pp. 6050–6059.

[18] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-endrecovery of human shape and pose,” in Proc. Conf. Comput. Vis. PatternRecognit., 2018, pp. 7122–7131.

[19] R. Alp Guler, N. Neverova, and I. Kokkinos, “Densepose: Dense humanpose estimation in the wild,” in Proc. Conf. Comput. Vis. Pattern Recognit.,2018, pp. 7297–7306.

[20] K. Rematas, I. Kemelmacher-Shlizerman, B. Curless, and S. Seitz, “Soc-cer on your tabletop,” in Proc. Conf. Comput. Vis. Pattern Recognit.,2018, pp. 4738–4747.

[21] D. Mehta et al., “Single-shot multi-person 3D pose estimation frommonocular RGB,” in Proc. IEEE Int. Conf. 3D Vis., 2018, pp. 120–130.

[22] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “BerkeleyMHAD: A comprehensive multimodal human action database,” in Proc.IEEE Workshop Appl. Comput. Vis., 2013, pp. 53–60.

[23] “CMU mocap database,” [Online]. Available: http://mocap.cs.cmu.edu/,Accessed on: 2018.

[24] D. Mehta et al., “Monocular 3D human pose estimation in the wildusing improved CNN supervision,” in Proc. IEEE Int. Conf. 3D Vis.,2017, pp. 506–516.

[25] “The captury,” 2017. [Online]. Available: http://thecaptury.com/[26] T. von Marcard, B. Rosenhahn, M. J. Black, and G. Pons-Moll, “Sparse

inertial poser: Automatic 3D human pose estimation from sparse IMUs,”Comput. Graph. Forum, vol. 36, no. 2, pp. 349–360, 2017.

[27] T. von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and G. Pons-Moll, “Recovering accurate 3D human pose in the wild using IMUsand a moving camera,” in Proc. Eur. Conf. Comput. Vis. Springer, 2018,pp. 601–617.

[28] M. F. Ghezelghieh, R. Kasturi, and S. Sarkar, “Learning camera viewpointusing CNN to improve 3D body pose estimation,” in Proc. IEEE Int. Conf.3D Vis., 2016, pp. 685–693.

[29] W. Chen et al., “Synthesizing training images for boosting human 3Dpose estimation,” in Proc. IEEE Int. Conf. 3D Vis., 2016, pp. 479–488.

[30] G. Varol et al., “Learning from synthetic humans,” in Proc. Conf. Comput.Vis. Pattern Recognit., 2017, pp. 109–117.

[31] G. Rogez and C. Schmid, “Mocap-guided data augmentation for 3D poseestimation in the wild,” in Proc. Adv. Neural Inf. Process. Syst., 2016,pp. 3108–3116.

[32] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis, “Harvestingmultiple views for marker-less 3D human pose annotations,” in Proc.Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6988–6997.

[33] Y. Huang et al., “Towards accurate markerless human shape andpose estimation over time,” in Proc. IEEE Int. Conf. 3D Vis., 2017,pp. 421–430.

[34] “Pedx dataset,” 2019. [Online]. Available: http://pedx.io/[35] S. Boyd et al., “Distributed optimization and statistical learning via the al-

ternating direction method of multipliers,” Found. Trends R© Mach. Learn.,vol. 3, no. 1, pp. 1–122, 2011.

[36] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc.IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.