Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps,...

7
Robust Driver Head Pose Estimation in Naturalistic Conditions from Point-Cloud Data Tiancheng Hu, Sumit Jha and Carlos Busso Abstract— Head pose estimation has been a key task in computer vision since a broad range of applications often requires accurate information about the orientation of the head. Achieving this goal with regular RGB cameras faces challenges in automotive applications due to occlusions, extreme head poses and sudden changes in illumination. Most of these challenges can be attenuated with algorithms relying on depth cameras. This paper proposes a novel point-cloud based deep learning approach to estimate the driver’s head pose from depth camera data, addressing these challenges. The proposed algorithm is inspired by the PointNet++ framework, where points are sampled and grouped before extracting discrimina- tive features. We demonstrate the effectiveness of our algorithm by evaluating our approach on a naturalistic driving database consisting of 22 drivers, where the benchmark for the orien- tation of the driver’s head is obtained with the Fi-Cap device. The experimental evaluation demonstrates that our proposed approach relying on point-cloud data achieves predictions that are almost always more reliable than state-of-the-art head pose estimation methods based on regular cameras. Furthermore, our approach provides predictions even for extreme rotations, which is not the case for the baseline methods. To the best of our knowledge, this is the first study to propose head pose estimation using deep learning on point-cloud data. I. INTRODUCTION Head pose estimation (HPE) is an important task in computer vision. It has a wide range of applications in areas such as advanced driver assistance system [1], visual attention modeling [2] and gaze estimation systems [3]– [6]. In automotive applications, which is the focus of this study, head pose estimation can be very valuable. Head pose estimation can be used to monitor driver’s awareness [7], [8], assist interaction with in-car entertainment system [9], understand driver’s intention and predict future driver actions [10]. Therefore, having a reliable head pose estimation system is crucial for advances in in-vehicle safety systems. The topic of HPE has been extensively studied using regular RGB cameras. The solutions include nonlinear man- ifold learning [11], random forest (RF) [12], support vector machine (SVM) [13], and most recently, deep learning [9], [10], [14], [15]. However, progress on developing robust head pose estimation for automotive applications is limited [1], [10], [15], [16]. Existing research has shown promising results, but there is still room for improvement as existing HPE algorithms commonly suffer from high errors when *This work was supported by Semiconductor Research Corporation (SRC) / Texas Analog Center of Excellence (TxACE), under task 2810.014. Tiancheng Hu, Sumit Jha and Carlos Busso are with the De- partment of Electrical Engineering, University of Texas at Dal- las, Richardson, Texas, USA. {[email protected], [email protected], [email protected]} (a) Occlusion (b) Extreme head pose (c) Illumination Fig. 1: Examples of three types of challenges for HPE in a car. The frames are from recordings from the MVAD corpus. there are occlusion of the driver, extreme head rotations, and sudden changes in illumination in the vehicle [17]. (Fig. 1). A time-of-flight camera based solution can, to a certain degree, alleviate some of the issues affecting head pose estimation in the vehicle. Because time-of-flight cameras use active infrared lighting to acquire images, they are immune to sudden illumination changes, which commonly occur in a moving vehicle. The most common way to represent depth camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach ignores important 3D spatial information. Instead, we represent depth camera recording using point-clouds. A point-cloud contains 3D positions of each point relative to the center of the camera. Figure 2(a) shows an example of a point-cloud frame, describing the face of the driver. Inspired by the success of point-cloud approaches such as PointNet++ [18], we present a deep learning based end- to-end regression algorithm to estimate the driver’s head pose that directly operates on point-cloud data. We modified the set abstraction layer from PointNet++ as the backbone of our model. In each set abstraction layer, we have three components: sampling, grouping and PointNet. For sampling, we perform iterative farthest point sampling (IFPS) to choose anchor points from the input. For grouping, we use ball query to group neighboring points of each anchor point together. Then, we perform PointNet, which consists of a parameter shared multilayer perceptron (MLP) to capture local features, and a maxpooling operation after the MLP. We repeat the set abstraction layers five times, where each of them is implemented with three different scales. Finally, we generate a 6D vector as our prediction from linear combinations of the output from the last set abstraction layer from which we infer the orientation of the head. To the best of our knowledge, this is the first work on HPE using deep learning-based algorithm on point-cloud. We use a multimodal database to demonstrate the ef-

Transcript of Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps,...

Page 1: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

Robust Driver Head Pose Estimation in Naturalistic Conditions fromPoint-Cloud Data

Tiancheng Hu, Sumit Jha and Carlos Busso

Abstract— Head pose estimation has been a key task incomputer vision since a broad range of applications oftenrequires accurate information about the orientation of thehead. Achieving this goal with regular RGB cameras faceschallenges in automotive applications due to occlusions, extremehead poses and sudden changes in illumination. Most of thesechallenges can be attenuated with algorithms relying on depthcameras. This paper proposes a novel point-cloud based deeplearning approach to estimate the driver’s head pose fromdepth camera data, addressing these challenges. The proposedalgorithm is inspired by the PointNet++ framework, wherepoints are sampled and grouped before extracting discrimina-tive features. We demonstrate the effectiveness of our algorithmby evaluating our approach on a naturalistic driving databaseconsisting of 22 drivers, where the benchmark for the orien-tation of the driver’s head is obtained with the Fi-Cap device.The experimental evaluation demonstrates that our proposedapproach relying on point-cloud data achieves predictions thatare almost always more reliable than state-of-the-art head poseestimation methods based on regular cameras. Furthermore,our approach provides predictions even for extreme rotations,which is not the case for the baseline methods. To the bestof our knowledge, this is the first study to propose head poseestimation using deep learning on point-cloud data.

I. INTRODUCTION

Head pose estimation (HPE) is an important task incomputer vision. It has a wide range of applications inareas such as advanced driver assistance system [1], visualattention modeling [2] and gaze estimation systems [3]–[6]. In automotive applications, which is the focus of thisstudy, head pose estimation can be very valuable. Head poseestimation can be used to monitor driver’s awareness [7],[8], assist interaction with in-car entertainment system [9],understand driver’s intention and predict future driver actions[10]. Therefore, having a reliable head pose estimationsystem is crucial for advances in in-vehicle safety systems.

The topic of HPE has been extensively studied usingregular RGB cameras. The solutions include nonlinear man-ifold learning [11], random forest (RF) [12], support vectormachine (SVM) [13], and most recently, deep learning [9],[10], [14], [15]. However, progress on developing robusthead pose estimation for automotive applications is limited[1], [10], [15], [16]. Existing research has shown promisingresults, but there is still room for improvement as existingHPE algorithms commonly suffer from high errors when

*This work was supported by Semiconductor Research Corporation(SRC) / Texas Analog Center of Excellence (TxACE), under task 2810.014.

Tiancheng Hu, Sumit Jha and Carlos Busso are with the De-partment of Electrical Engineering, University of Texas at Dal-las, Richardson, Texas, USA. {[email protected],[email protected], [email protected]}

(a) Occlusion (b) Extreme head pose (c) Illumination

Fig. 1: Examples of three types of challenges for HPE in acar. The frames are from recordings from the MVAD corpus.

there are occlusion of the driver, extreme head rotations,and sudden changes in illumination in the vehicle [17]. (Fig.1). A time-of-flight camera based solution can, to a certaindegree, alleviate some of the issues affecting head poseestimation in the vehicle. Because time-of-flight cameras useactive infrared lighting to acquire images, they are immuneto sudden illumination changes, which commonly occur in amoving vehicle. The most common way to represent depthcamera recording is through depth maps, which project depthinformation into a single view [9], [10], [12], [15]. Thisapproach ignores important 3D spatial information. Instead,we represent depth camera recording using point-clouds. Apoint-cloud contains 3D positions of each point relative tothe center of the camera. Figure 2(a) shows an example ofa point-cloud frame, describing the face of the driver.

Inspired by the success of point-cloud approaches suchas PointNet++ [18], we present a deep learning based end-to-end regression algorithm to estimate the driver’s headpose that directly operates on point-cloud data. We modifiedthe set abstraction layer from PointNet++ as the backboneof our model. In each set abstraction layer, we have threecomponents: sampling, grouping and PointNet. For sampling,we perform iterative farthest point sampling (IFPS) to chooseanchor points from the input. For grouping, we use ball queryto group neighboring points of each anchor point together.Then, we perform PointNet, which consists of a parametershared multilayer perceptron (MLP) to capture local features,and a maxpooling operation after the MLP. We repeat theset abstraction layers five times, where each of them isimplemented with three different scales. Finally, we generatea 6D vector as our prediction from linear combinations of theoutput from the last set abstraction layer from which we inferthe orientation of the head. To the best of our knowledge, thisis the first work on HPE using deep learning-based algorithmon point-cloud.

We use a multimodal database to demonstrate the ef-

Page 2: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

(a) point-cloud data (b) Grayscale image

Fig. 2: Examples of images from the depth camera. (a) point-cloud data from one frame, (b) grayscale image from thedepth camera used as a reference for calibration (Sec. IV-B).

fectiveness of our algorithm. The model is trained andevaluated with data from 22 subjects, who participate innaturalistic driving recordings with ground truth for headpose information provided by the Fi-Cap device [19]. Wecompare this algorithm with two automatic HPE algorithmsusing regular RGB cameras: OpenFace 2.0 [20] and FAN[21]. The results show that using our point-cloud algorithmleads to lower mean square errors (MSE) over the baselinesfor most conditions. Likewise, we obtain predictions for awider range of HPE compared to the baselines, which isanother important benefit of our approach. These resultsdemonstrate the benefits of our proposed solution.

II. RELATED WORKA. General Head Pose Estimation

The computer vision community has extensively studiedthe problem of head pose estimation. Murphy-Chutorian andTrivedi [22] and Czuprynski and Strupczewski [23] surveyedstudies on head pose estimation. Methods using appearance-based models have recently received increased attention dueto the popularity of convolutional neural network (CNN).Liu et al. [24] proposed a regression model implementedwith CNN trained with synthesized RGB images. Theydemonstrated significant improvement in HPE compared toprevious works. Most existing works have focused on HPEfrom RGB images. Recently, commercial depth cameras havebeen increasingly accessible, which has led to a growinginterest in HPE from depth images and point-clouds. Meyeret al. [25] localized 3D head in a depth image, fitting the headportion of the depth image onto a 3D morphable face modelby combining the particle swarm optimization algorithm andthe iterative closest point algorithm.

B. Driver Head Pose StudiesHPE is a difficult task inside vehicles using regular cam-

eras. Jha and Busso [17] evaluated state-of-the-art HPE al-gorithms concluding that illumination, extreme head rotationand occlusions significantly impact the results. In fact, theselected algorithms were not able to provide an output in8.9% to 21.8% of the frames.

Murphy-Chutorian et al. [1] used two support vectorregressors (SVRs) to estimate head rotations (pitch andyaw). They detected the face with a cascaded-Adaboost facedetector. They extracted the localized gradient orientation

(LGO) histogram for each detected head, which were usedas features to train the SVRs. LGO histograms were morerobust to light conditions, as tested on a real vehicle. Baret al. [26] designed an approach for head pose and gazeestimation intended for in-vehicle applications. The methoduses multiple 3D templates that are aligned with point-clouddata using the iterative closest point (ICP) algorithm. Toaddress the problem of temporal partial occlusions of thedriver’s face, they adopted multiple different face templates,taking the mean value of the different estimations. Theapproach was evaluated with laboratory recordings. Borghiet. al [9] utilizes a multi-modal approach based on severalCNNs. They trained one CNN for head localization. Theyused three CNNs independently trained with either RGBimages, depth data or optical flow information. The threeCNNs were then fused to consolidate the HPE results. Themethod was evaluated with data from the POSEidon corpus,which was collected in a car simulator. Schwarz et al. [10]proposed a CNN based model that fuses information frominfrared and depth images for HPE, evaluating their modelson the DriveAHead corpus.

C. Processing Depth Data Using Deep LearningThe most common approach to represent data from depth

cameras is to create 2D depth maps, where a value in thearray provides its depth information at that specific x and ylocation. This representation has been used widely used inHPE [12], [25]. For example, studies have used CNN-basedapproaches to process depth maps for HPE [10], [15].

An alternative way to represent depth data is with point-cloud, which is a two dimensional array containing the x,y and z locations of each detected point in the coordinatesystem of the camera. For point-cloud classification tasks,Wu et al. [27] proposed a 3D CNN approach that represent3D point-cloud as a binary probabilistic distribution on a3D voxel grid. They used a 3D CNN to process the grid.This approach is not very effective because converting a3D point-cloud into a 3D voxel grid inevitably leads toinformation loss. This method also requires a large modelsize and, therefore, it is harder to train, especially giventhat point-cloud datasets are usually much smaller than RGBimage datasets. Another approach proposed by Su et al. [28]was to generate 2D image renderings of a 3D point-cloud,and used these renderings to train a CNN-based algorithmwith view polling for 3D object classification. This approachgives more compelling results compared to the 3D voxelgrid CNN approach. More recently studies include PointNet[29] and PointNet++ [18]. These two approaches directlyprocess 3D point-cloud data without converting to any otherintermediate representation, showing very competitive resultson classification and segmentation tasks.

III. PROPOSED APPROACHInspired by the success of PointNet [29] and PointNet++

[18], we propose a model that utilizes the set abstractionlayers used in PointNet++ as a feature extractor for HPE. Thefeature representation is the input of a fully connected layerwith linear activation to estimate the rotation of the head.

Page 3: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

x5

6D

PointNetSampling Grouping

SetAbstractionLayer

Fig. 3: Proposed point-cloud approach for HPE. The modelcreates a 6D vector from which we can estimate the headrotation of the driver.

Zhou et al. [30] show the benefits of representing orientationwith a 6D vector extracted from the rotation matrix. Wefollow this approach by training our model to estimate this6D vector. Notice that it is easy to convert the 6D vector intoa full rotation matrix by using the Gram-Schmidt process.

Consider the point-cloud from frame t, consisting of nt

points. We create a nt × 3 vector Xt. The proposed deeplearning model creates a mapping f such that

f(Xt) = y, y ∈ R6 (1)

where y is the 6D vector representing the head orientation.This section explains our proposed approach.

A. Model StructureFigure 3 shows a diagram of our proposed point-cloud

model. Table I shows the implementation details of ourmodel using five set abstraction layers, where the numberof sampled points in each set is sequentially reduced.

The main blocks of the set abstraction layer are sampling,grouping, and PointNet. The first step is sampling pointsfrom the point-cloud data, where we reduce the level ofredundancy in the raw data, and increase the computationalefficiency of the algorithm. We use the iterative farthest pointsampling (IFPS) algorithm, choosing a predefined numberof anchor points. For example, the first set abstraction layersamples 512 points, denoted by S1 in Table I.

The second step is grouping the anchor points. Similar tothe way CNN works, the goal is to capture the relationshipbetween the anchor points and their neighboring points. Thealgorithm uses ball query to group nearby points aroundan anchor point. The grouping is implemented at threedifferent resolutions. For example, the first set abstractionlayer simultaneously groups l = 4 points within r = 0.1units of each anchor point (G1), l = 8 points within r = 0.2units of each anchor point (G2), and l = 16 points withinr = 0.4 units of each anchor point (G3).

The third step is PointNet [29], which aims to find adiscriminative feature representation. This goal is achievedby transforming the points into a higher dimension, wherewe can achieve better discrimination. We can view thesetransformations as extra layers that capture the features ofeach anchor point as well as its neighboring areas. Point-Net is implemented with three-layer multilayer perceptrons(MLPs) with shared weights. Every point is passed throughthe same set of filters. The MLP projects the point into a

different feature representation. There are polling operationsafter each MLP transformation. For example, PointNet onthe first set abstraction layer operates on the groupings G1,G2 and G3. For the grouping G1, the three layers of theMLP have 8, 8 and 16 neurons, respectively. We use themaxpooling operation to create the 512 × 3 output (T1).For the grouping G2, the three layers of the MLP have16, 16 and 32 neurons, respectively. After the maxpoolingoperation, we create a 512×3 output (T2). For the groupingG3, the three layers of the MLP have 16 , 24 and 32 neurons,respectively. We also obtain a 512× 3 output (T3) after themaxpooling operation. PointNet provides a 512 × 9 output(T4) by concatenating T1, T2, and T3.

Our proposed point-cloud solution uses five set abstractionlayers. The second set abstraction layer is exactly the sameas the first set, except that the input is no longer the rawpoint-cloud data. Instead, we perform sampling from S1,and grouping from T4 (e.g., sampling from the anchorpoints selected in the previous set, and grouping from thetransformed anchor points from the previous set). From thesecond set onward, the PointNet layer will take as input theconcatenation of the grouping step output and the anchorpoints. The third and fourth set abstraction layers are similarto the second set. The fifth set abstraction layer only hasone anchor point. In the grouping layer, we group all pointstogether by using an infinity radius. Then, we perform theusual PointNet layer, but with weighted average pooling atthe end (T5).

PointNet++ was designed for classification and segmen-tation problems. Instead, we are using the set abstractionlayer to obtain a discriminative feature representation for aregression problem. The final step is to connect T5 to a fullyconnected layer with linear output units, creating a 6D vector.

IV. MULTIMODAL DRIVER MONITORING DATASETThis study uses recording from the Multimodal Driver

Monitoring (MDM) Dataset, which we are currently col-lecting at The University of Texas at Dallas (UT Dallas).The corpus consists of naturalistic driving recordings on theUTDrive vehicle [31]–[33], which is a 2006 Toyota RAV4equipped with multiple sensors. This data collection is anongoing effort, where the set used for this study containsrecordings from 22 subjects (18 males, 4 females), with atotal duration of 17 hours and 39 minutes. The subjects aremostly college students attending UT Dallas.

In this data collection, we use four GoPro HERO6 Blackcameras installed to collect (1) frontal views of the driver’sface, (2) semi-profile views of the driver’s face as viewedfrom the rear mirror, (3) the back side of the driver’s headto record the Fi-Cap device (see Sec. IV-B), and (4) theroad. We set the frame rate to be 60 fps with a resolutionof 1920 × 1080 pixels. Likewise, we use a CamBoard picoflexx for depth data recording. The camera is capable ofcapturing both depth data and grayscale images. Pico flexxsupports a resolution of 224 × 171 and we set the framerate to the maximum, which is 45 fps. The measuring rangeis between 0.1m and 4m which suits the requirement for in-vehicle recordings of a driver. We use a clapping board to

Page 4: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

TABLE I: Model specifications for the proposed approach.m is the number of sample chosen as anchor points. dc = 3is the dimension of the input coordinate system (i.e., x, yand z). r represents the radius of the ball query for groupingoperation. l represents the number of points within r to bechosen in the grouping step. df is the feature dimension ofthe previous set. In the first set, df=dc=3.

Layer Specification Output Dimension

Sampling m=512 [m=512,dc=3] (S1)Grouping [r=0.1,l=4] [m=512,df =3,l=4] (G1)

[r=0.2,l=8] [m=512,df =3,l=8] (G2)[r=0.4,l=16] [m=512,df =3,l=16] (G3)

PointNet [8,8,16] [512,3] (T1)[16,16,32] [512,3] (T2)[16,24,32] [512,3] (T3)

[512,9] (T4)Sampling m=256 [m=256,dc=3]Grouping [r=0.15,l=8] [m=256,df =12,l=8]

[r=0.3,l=16] [m=256,df =12,l=16][r=0.5,l=24] [m=256,df =12,l=24]

PointNet [16,16,64] [256,64][32,32,64] [256,64][48,48,96] [256,96]

[256,224]Sampling m=128 [m=128,dc=3]Grouping [r=0.15,l=8] [m=128,df =227,l=8]

[r=0.3,l=16] [m=128,df =227,l=16][r=0.5,l=24] [m=128,df =227,l=24]

PointNet [16,16,64] [128,64][32,32,64] [128,64][48,48,96] [128,96]

[128,224]Sampling m=64 [m=64,dc=3]Grouping [r=0.2,l=16] [m=64,df =227,l=16]

[r=0.4,l=32] [m=64,df =227,l=32][r=0.8,l=48] [m=64,df =227,l=48]

PointNet [32,32,64] [64,64][48,48,64] [64,64][64,64,128] [64,128]

[64,256]Sampling m=1 [m=1,dc=3]Grouping [r=inf] [64,259]PointNet [256,256,384] [384] (T5)Fully Connected Layer 6 [6]

synchronize the two types of cameras at the beginning ofeach recording. We align intermediate frames based on thetime elapsed from the clapping frame to the current frame.

A. Data Collection ProtocolWe ask the driver to perform the following protocol:

• Phase I: Drivers are asked to look at markers inside andoutside the car at different locations while the car is parked.• Phase II: Drivers are asked to operate the vehicle fol-lowing a predefined route. While driving, they are asked toidentify landmarks on the route and cars satisfying certainconditions (e.g., red cars and specific car make). We alsoasked them to look at markers inside the vehicles.• Phase III: Drivers are asked to conduct secondary tasksfollowing the instructions given by the researcher leading thedata collection. The tasks include changing the channel on aradio, and following GPS directions.

The subjects are told to only follow our instructions whenthey believe it is safe to conduct these tasks. They have theright to withdraw from the experiment at any time duringthe recording.

B. Ground Truth labels for HPE using Fi-CapA key feature of the corpus is that robust head pose labels

can be easily assigned to each frame, with the use of the Fi-

Cap device [19]. Fi-Cap is a 3D helmet with 23 AprilTags[34]. AprilTags are 2D fiducial markers with predefinedpatterns that can be robustly detected with regular cameras.It is easy to estimate the orientation and position of eachAprilTag with regular cameras. The Fi-Cap is worn on theback side of the driver’s head, without creating occlusionsfor cameras recording the frontal views of the face. Thepurpose of the camera behind the driver is to record asmany AprilTags as possible to derive reliable labels for HPE.We use the Kabsch algorithm and the ICP algorithm toobtain the ground truth for the head pose from all the visibleAprilTags (details are explained in Jha and Busso [19]). Wehave reported a detailed analysis on the performance of Fi-Cap, finding that the median angular error was only 2.30◦

[19].Fi-Cap gives the head pose labels in terms of the Fi-

Cap coordinate system. We need to calibrate the coordinateacross subjects, because the participants are different, and theactual placement of the Fi-Cap varies across recordings. Wemanually select one grayscale frame from the Pico Flexxcamera as a global reference (Fig. 2(b)). All the anglesare expressed with respect to this reference. Then, we useOpenFace 2.0 [20] (Sec. V-B) on the grayscale recordings ofeach subject to find the frame for each subject that has themost similar head pose orientation as the reference frame.Finally, we correct the ground truth rotation of other framesfrom that subject by multiplying the inverse of the rotationmatrix associated with the reference frame.

V. EXPERIMENTAL EVALUATIONA. Implementation

We use ADAM optimizer, with an initial learning rateof 0.001. We use a learning rate decay of rate 0.7 every2 million steps. We use a batch size of 16. The model isimplemented with a L2 loss function comparing the predictedand actual 6D vectors representing the head rotation. We usedistance-based and statistical filters to remove points that areclearly part of the background. Then, we apply voxel griddownsampling to sample 5000 points from each of our point-clouds. Each point-cloud is normalized such that the centroidis at the origin, and all the points lie inside a unit sphere.

We split the corpus using participant-independent parti-tions, where data from 14 subjects are used for the train set,data from four subjects are used for the development set, anddata from four subjects are used for the test set.

B. BaselinesThe baselines in this study are two state-of-the-art HPE

using regular cameras. While our approach processes datafrom the CamBoard pico flexx, the baselines are evaluatedusing the high-resolution frames from the GoPro HERO6Black camera facing the driver.

The first approach is OpenFace 2.0 [20], which is areal-time toolkit that supports many facial behavioral anal-ysis tasks, including HPE. It utilizes convolutional expertsconstrained local model (CE-CLM) [35] to detect faciallandmark. The head rotation is obtained by using the 3Dfacial landmarks to solve the Perspective-n-Point Problem.

Page 5: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch a

ngle

s

3.022 3.152

3.04

3.078

3.103

3.2

3.145

3.216

3.314

3.362

3.012

3.4

3.59

3.143

3.175

3.686

3.782

3.471

3.384

4.159

4.227

3.649

3.238

4.02

4.725

4.812

4.32

3.431

3.696

4.392

4.739

4.684

4.275

3.316

3.205

3.606

4.114

4.312

4.346

4.087

3.284

3.638

3.075

3.32

3.744

3.982

4.141

3.574

3.121

3.134

3.393

3.682

3.977

3.622

3.181

3.266

3.626

3.799

3.5

3.177

3.496

3.628

3.517

3.075

3.486

3.393

3.107

3.0640.301

0.301

0

0.301

0.7782

0.6021

0

0.301

0

0.699

1.74

1

1.38

1.763

1.857

1.653

1.653

0.699

1.079

0.7782

0.301

0.8451

1.748

2.076

1.69

1.813

1.881

2.161

1.934

2.196

2.644

1.996

2.787

2.033

1.447

1

1.23

0.6021

2.217

2.587

2.233

1.851

2.064

2.241

2.167

2.152

1.716

1.602

0.9542

0.8451

1.146

0.4771

1.041

0.8451

1.968

2.559

2.752

2.716

2.792

2.505

2.286

2.373

2.086

1.778

1.204

0.8451

1.924

2.617

2.832

2.964

2.733

2.548

2.553

2.09

1.38

1.919

0.4771

1.519

0.7782

0.4771

2.158

2.585

2.91

2.754

2.686

2.299

1.881

1.613

1.322

1.724

1.255

1.204

1.477

2.453

2.93

2.526

2.199

2.037

0

1.806

2.56

2.763

2.238

1.724

1.322

0.4771

1.204

1.869

2.697

2.975

2.384

2.161

1.431

1.342

2.072

1.176

2.455

2.945

2.543

2.185

1.176

1.531

1.732

0.7782

0.4771

0

0

1.146

2.38

2.889

2.494

2.104

1.934

1.839

1.322

1.903

1.398

0.4771

1.176

0.4771

2.501

2.78

2.049

1.863

1.568

1.301

1.398

0.699

0.7782

1.785

2.22

2.664

2.978

2.627

2.196

1.908

1.204

0.301

0.301

1.544

2.127

2.842

2.064

1.633

1.447

0.699

0

1.204

2.791

2.651

2.915

2.207

0.9031

0.699

1

2.698

2.61

0.6021

0.7782

2.701

2.579

1.982

1.505

0.9542

0.301

0.6021

1.176

2.367

2.92

2.785

2.515

2.512

2.248

1.74

0.699

0.9031

1.653

2.676

2.885

2.477

2.659

2.328

2.127

1.724

1.204

0.8451

1.771

2.513

2.948

2.876

2.883

2.236

2.342

2.146

1

0.301

0.9031

1.602

2.193

2.559

2.978

2.603

2.233

1.342

1.398

0.301

1.079

1.716

2.267

2.98

2.915

2.616

2.36

1.556

0.699

0.301

0.6021

1.982

2.09

2.609

2.908

2.794

2.84

2.722

2.346

1.924

0.6021

0

0

0.7782

0.7782

1.756

2.684

2.699

2.533

2.665

2.848

2.384

1.792

0.4771

0.4771

0

0.4771

0.301

1.342

1.146

0.9031

1.146

2.207

2.188

2.408

2.093

2.533

2.595

2.086

1.672

1.041

1.724

1

0.4771

0.6021

1.176

1

1.851

1.279

1.881

1.591

1.982

2.173

2.19

1.176

1.602

1.58

1.633

1.114

1.845

0

0.7782

1.301

1.623

1.342

1.342

1.447

1.602

1

0.9542

1

0.301

0

0.301

0.7782

1.342

0.9542

0.4771

0

0

1

0.9542

1

1.5

2

2.5

3

3.5

4

4.5

5

NaN

Fig. 4: The distribution of the frames in the test set as afunction of yaw (horizontal) and pitch (vertical) angles. Thefigure is in logarithmic scale for better visualization. Binswith lighter the colors indicate more frames are found in thecorpus. The bins with white color and no number indicatethat no frames are available in the specific rotation range.

The second baseline is the face alignment network (FAN)[21], which is a state-of-the-art deep learning based faciallandmark estimation algorithm. We use the RGB imagesfrom the GoPro camera that match the reference framesfound in the calibration step for each of our subjects.Then, we extract 68 landmarks with FAN for each of thesereference frames. We estimate the head pose of each frameby comparing the positions of its 68 facial landmarks with thecorresponding landmarks from the reference frame. We usesingular value decomposition (SVD) to calculate the rotation.

The predictions from OpenFace 2.0 and FAN are withrespect to their own coordinate systems. We apply a subject-wise transformation so that the coordinate system matchesthe coordinate system of the pico fexx camera. For a givendriver, we find transformations between the ground truthlabels and the predictions of a baseline for all the frames.Then, we find the average transformation, which is used tocompensate for the differences in the coordinate systems.The ground truth has been filtered to consider only frameswith roll, yaw and pitch rotations below 45◦, 80◦and 70◦,respectively. These values are considered following the headrotation limits of an average adult [36].

C. ResultsFigure 4 shows the head pose histogram for the test set as

a function of the yaw and pitch angles estimated with Fi-Cap.The number in each bin indicates the number of frames inthat specific rotation range, where we use a logarithmic scale(base 10) for better visualization.The majority of frames havenear frontal pose, which is expected in naturalistic driving.Although there is comparatively less number of frames withnon-frontal poses, these frames are crucial when studyingdriver distraction or maneuvering behavior.

Most HPE methods based on regular RGB images tend tofail in challenging conditions in naturalistic driving scenarios[17]. In these cases, the face may not be even detected.A benefit of our proposed point-cloud approach is that wecan provide an estimation for all the frames. This is notthe case for our baselines methods. In fact, OpenFace 2.0and FAN do not provide results for 4.85% and 2.10% ofthe frames, respectively. Figure 5 shows the percentage of

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch

an

gle

s

87.5

58.62

77.78

44.44

20.82

57.32

33.56

28.28

31.48

35.71

28.99

25.65

36.66

25.81

29.43

29.68

52.17

21.02

37.7

21.75

25.24

33.33

33.33

21.33

63.64

83.9

93.33

51.75

28.48

66.67

59.59

45.35

97.56

44.77

90

75

80

29.33

51.4

100

56.15

90.73

99.38

100

100

23.47

78.62

85.02

75

58.53

57.69

24.68

28.36

79.31

22.12

20.88

21.81

30.42

53.99

20

31.5

26.37

44.68

37.13

57.82

47.3

51.68

22.22

21.25

23.98

32.28

63.01

65.15

45.11

41.56

100

35.02

100

100

21.57

33.15

58.22

84.09

100

38.32

67.62

78.95

21.39

65.22

81.82

38.89

32.19

75.31

71.17

86.67

89.66

22.86

45.32

45.61

41.69

25.5

25

84.62

95.45

45

40.91

38

61.11

20

25

42.86

0

0

1.838

0

8.333

0

0

0

0

0

18.32

3.106

0

0

10.69

16.97

0

0

11.26

6.154

7.985

16.41

10.12

6.667

0

0

0

1.974

15.65

11.96

12.41

13.45

16.61

0

0

2.381

0

1.009

10.7

11.13

7.734

11.57

7.767

0

0

0

1.241

1.393

10.35

15.19

12.46

0

0

0

1.01

3.13

2.538

6.619

15.62

0

1.902

7.962

6.324

4.778

9.931

8.273

6.829

0

10.01

1.418

5.016

7.341

7.489

0

0

3.892

1.829

1.015

0.9844

1.351

9.756

13.56

8.411

0

1.126

5.248

0.8965

12.03

0.6523

3.186

15.03

19.88

7.692

0

0

0

0.6023

0.133

3.11

1.935

10.29

17.21

19.35

0

0

0

0

0.02595

0.2599

0.6923

1.154

3.882

10.26

4.56

11.14

0

0

0

0.4312

0.3102

0.582

1.086

1.268

5.761

14.13

18.18

0

0

0

0

0.08088

0.3268

0.8311

0.64

10.09

0

0

0.02518

0.05384

0.4201

0.6585

0.5216

1.988

0

0

0.3362

5.002

2.3

4.507

4.396

1.953

0

9.489

10.3

13.87

18.6

15.44

10.59

18.3

0

12.4

14.72

14.29

16.07

16.38

18.43

0

5.69

6.836

10.4

0

0

16.16

5.445

3.289

6.282

15.9

19.14

13.33

0

5.769

11.35

2.316

5.407

7.962

0

8.333

9.945

2.195

7.742

18.09

14.55

0

0

7.843

2.114

3.899

11.41

17.77

0

0

0

8.824

2.699

5.825

3.333

17.36

0

10

1.031

17.13 0

0

5

10

15

20

25

30

35

40

NaN

(a) Detection failure rate for OpenFace 2.0

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch

an

gle

s

33.33

36.36

37.29

53.33

22.81

30.3

48.15

28

21.62

31.23

32.21

22.98

20.84

21.13 20.83

26.06

21.77

29.69

22.11

48.28

27.17

42.86

50

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0.5848

11.76

0

0

0

0

0

0

0

0.6186

3.535

14.55

0

0

0

0

0.2113

0.1823

0.5869

5.952

3.797

16.87

0

0

0

0.8682

2.65

0.5172

4.046

16.98

0

0

0

3.614

0.8596

0.6695

0.7415

2.066

2.069

17.71

2.182

0.5211

0.5831

1.146

2.632

0

0

0

3.876

0.6765

9.376

1.438

3.205

6.299

18.6

7.246

14.29

1.25

0

0

1.136

0.7002

2.906

1.184

2.653

10.71

17.81

10

16

0

0

0.4338

1.652

0.957

0.7766

1.95

5.158

7.311

16.56

2.469

12.5

0

0

0.4624

0.5057

0.8179

1.11

2.932

3.746

6.043

3.448

13.95

0

0

0.9709

0.865

1.077

0.8093

1.043

3.024

3.818

9.375

0

0.8016

0.1735

1.039

0.5952

0.9658

9.223

2.547

6.88

0

0

1.335

0.6881

0.991

3.844

0.9091

12.91

7.916

0

0

0

0

0

0.7287

0.6445

1.032

6.326

5.735

8.21

6.116

4.308

0

0

0

0

1.733

0.5201

1.365

3.066

7.114

4.427

5.667

9.333

0.9479

3.788

10.87

0

0

0.9019

0.6701

2.9

2.617

10.11

15.82

5.147

15.59

6.767

0

0

0.8287

1.936

0.09798

2.024

7.789

9.975

5.042

0

0

0

2.197

0.4691

0.1726

7.169

8.959

15.72

0

0

0.2463

0.4944

1.929

1.458

9.808

12

0

0

0.4

4.106

4.575

13.72

17.24

11.29

0

0

0

0

0

1.172

9.016

10.81

0

0

0

0

0

0

15.85

6.803

0

0

5

10

0

0

0

5

10

15

20

25

30

35

40

NaN

(b) Detection failure rate for FAN

Fig. 5: Detection failure rate of OpenFace 2.0 and FAN asa function of yaw and pitch rotation angles. In contrast, theproposed approach provides estimation for all the frames.The detection failure increases in bins with lighter colors.The bins with white color and no number indicate that noframes are available in the specific rotation range.

frames in each bin for which the baseline methods failedto provide a prediction. The figure ignores bins with fewerthan 10 frames. We observe that OpenFace 2.0 has a higherfailure rate at large yaw and pitch angles. For some anglecombinations, OpenFace 2.0 fails for all frames. While FANhas a lower detection failure rate overall, it also suffers inextreme head poses. Instead, our approach takes the raw dataand directly predicts HPE, obtaining an estimation for allthe frames. This approach is practical in current naturalisticdriving scenarios where the presence of a person in the driverseat is expected when the car is moving.

Table II shows the mean square error (MSE) for our modeland the baselines. The first row of the table provides theresults for our proposed approach on all the frames in thetest set. The MSEs for all the angular rotations are between5◦and 8◦. These values show we can construct a robust HPEmodel using point-cloud data, which is particularly useful inchallenging environments such as inside a vehicle.

We also compare the MSE of our method with Open-Face2.0 and FAN. Since these algorithms do not providepredictions for all the frames, we define two subsets forfair comparisons with our method. We define the OpenFaceset as the subset where OpenFace 2.0 provides a predic-tion. Likewise, we define the FAN set as the subset whereFAN provides an estimate. Table II shows that our methodprovides competitive result compared to both baselines. Inthe OpenFace set, our approach reaches lower MSE for rolland pitch rotations. The proposed method is slightly worsefor yaw rotations. It is important to highlight that OpenFace

Page 6: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

TABLE II: MSE of the proposed method using point-clouddata. The table also compares the HPE results with the resultsfrom OpenFace 2.0 and FAN using subsets of frames forwhich the baselines provided a prediction.

Method ErrorsRoll(◦) Yaw (◦) Pitch (◦)

Proposed Point-Cloud Solution 5.91 7.32 6.68OpenFace Set

OpenFace 2.0 9.32 6.21 8.42Proposed Point-Cloud Solution 5.48 7.15 6.39

FAN SetFAN 11.30 19.28 8.47Proposed Point-Cloud Solution 5.66 7.24 6.53

2.0 does not provide estimation for extreme rotations, so thebenefits of using our approach is not only better performancefor most of the angular rotations, but also predictions for allthe frames. In the FAN set, Table II clearly demonstratesthe superior performance of our algorithm over the FANalgorithm. Our model shows much lower MSE in all thethree rotation axes.

To understand better the performance of our proposedpoint-cloud approach and the baselines, we show the per-formance of each of these methods as a function of yaw(horizontal) and pitch (vertical) angles. Figure 6 shows theerrors compared to the ground truth head pose labels ob-tained with Fi-Cap. The error is represented by the geodesicdistance in the rotation matrix space [37].

∆(R1, R2) = ‖ log(R1RT2 )‖ (2)

The color of the bin indicates the error, where darkercolors correspond to smaller errors. We ignore bins withfewer than ten frames in this analysis. Bins that are com-pletely white indicate that there are not enough frames forthese bins. Note that for our proposed method, the histogramincludes frames where OpenFace 2.0 and FAN do not havepredictions. Our method provides far superior performancethan the baselines in all frames, especially for frames withlarge rotations. While the baselines provide reasonably goodperformance for small rotations, the error grows rapidly asthe ground truth rotation increases. This particularly clearwith FAN. OpenFace 2.0 provides very good rotation esti-mation when the overall rotation is small, but its performancedrops significantly as the rotation increases. In contrast, theerror in our approach grows much slower as the groundtruth rotation increases, allowing a wider range of rotations.Overall, the proposed point-cloud framework provides farsuperior performance in frames with large rotations whilemaintaining competitive performances at small rotations.

VI. CONCLUSIONSThis paper presented a novel point-cloud based algo-

rithm to estimate the head pose of a driver from a singledepth camera. The proposed algorithms achieved promisingperformances, outperforming the state-of-the-art baselinesOpenFace2.0 and FAN, which rely on regular RGB images.The improvement in performance is especially clear onframes with large rotations. For automotive applications suchas detecting driver distraction [7], [38], it is important to haveaccurate HPE on non-frontal poses. Our method is far more

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch

an

gle

s

22.48

21.81

21.8

20.56

20.26

32.19

20.35

20.04

21.06

22.34

37.09

26.54

22.76

21.44

25.62

20.25

30.94

20.29

26.05

37.05

32.64

23.12

22.77

30.36

28.58

20.17

29.96

32.04

23.55

22.99

24.08

22.07

24.16

28.5

21.52

22.67 21.56

20.18

31.88

20.92

24.25

23.01

23.71

21.74

23.87

29.97

24.2

22.5 24.52

31.85

59.73

22.03

57.58

61.02

35.247.948

11.41

17.88

13.11

14.74

15.06

14.01

16.78

18.75

16.16

14.39

17.11

16.08

16.69

17.1

16.17

19.24

14.13

17.83

15.83

19.11

16.27

17.38

16.53

16.37

16.64

17.77

18.93

18.67

16.6

17.4

17.72

17.96

17.94

17.83

17.29

14.08

13.38

13.5

17.61

16.62

15.91

15.47

14.33

12.71

12.81

16.03

16.98

15.82

14.94

12.76

14.3

12.59

13.01

14.64

15.67

18.78

12.35

11.6

13.21

11.71

12.62

18.51

19.97

12.57

11.75

12.53

11.56

13.07

9.384

13.73

15.89

19.2

19.02

12.22

11.34

18.1

9.964

10.3

12.59

12.23

14.31

18

19.43

14.57

9.938

8.567

10.69

8.711

11.56

13.71

12.76

15.64

19.25

11.48

11.22

8.11

7.888

7.97

8.44

11.32

12.07

11.32

14.9

7.195

7.006

6.106

6.815

7.087

7.822

11.33

14.7

17.85

12.69

5.668

6.195

5.465

6.728

7.056

8.328

11.84

16.62

5.619

6.47

6.972

7.359

8.311

10.44

11.54

16.51

9.317

7.302

9.732

8.912

10.04

11.86

11.22

16.72

19.83

13.84

13.32

8.927

8.815

8.806

9.772

10.01

11.41

13.39

13.95

16.05

16.72

8.639

9.717

9.861

10.21

9.832

11.53

12.89

14.65

13.76

9.703

11.02

11.45

10.26

10.8

10.72

11.93

14.33

17.24

19.46

19.31

11.52

9.963

12.32

12.3

11.7

13.11

14.91

18.26

12.08

11.41

11.87

14.09

14.46

15.22

13.5

11.54

11.34

13.41

13.16

14.93

14.74

12.65

18.72

16.48

12.16

14.63

12.24

10.12

11.56

17.42

15.41

9.521

14.13

19.93

15.11

14.6

19.26

17.42

9.641

13.15

12.93

10.06

14.91

14.89

8.664

11.21

13.21

0

5

10

15

20

25

30

35

40

NaN

(a) Proposed point-cloud approach

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch

an

gle

s

28.39

25.03

44.11

20.11

90

84.84

33.12

25.55

21.42

25.94

29.76

34.32

35.74

76.04

90

83.51

90

90

25.84

22.54

24.69

22.27

29.25

26.4

29.23

28.35

20.01

33.03

71.16

61.08

25.08

36.21

39.31

33.71

29.2

25.27

37.08

43.03

42.9

32.37

23.97

28.54

30.14

32.43

29.54

26.09

25.48

23.35

28.55

25.53

22.47

20.74

25.01

83.73 25.1

38.55

32.44

21.29

64.22

25.83

47.1

21.44

20.34

21.97

20.1

64.9

23.13

28.96

23.97

22.13

22.77

20.32

22.52

22.67

26.96

21.91

25.47

22.62

23.54

27.62

27.71

27.45

27.42

23.1

21.69

21.8

31.15

35.54

29.41

29.57

25.64

23.93

23.45

23.25

35.57

33.13

27.05

29.86

27.42

27.1

27.22

42.06

41.99

27.32

25.99

27.73

29.43

29.45

28.53

32.89

34.49

31.3

47.23

27.08

29.67

32.1

49.99

32.37

38.54

34.73

51.01

39.68

34.57

66.15

17.69 16.81

16.34

12.57

17.41

18.55

18.09

13.59

12.91

13.11

14.06

16.98

12.3

12.57

12.78

19.87

11.3

11.83

14.03

16.65

8.594

11.49

15.16

14.32

13.21

9.699

15.93

9.416

8.361

12.81

13.1

14.04

15.35

11.82

10.35

11.1

12.61

12.72

8.931

9.929

10.96

11.78

13.49

11.3

14.48

14.17

16.65

6.727

10.68

7.943

8.209

9.321

11.37

13.4

14.63

11.78

11.19

8.093

7.588

6.858

6.208

8.141

12.11

13.01

14.65

13.91

8.006

9.113

10.54

7.545

5.881

5.66

8.078

11.48

9.899

17.16

12.82

7.767

4.684

6.148

6.78

5.581

4.745

4.569

6.399

9.727

12.67

11.1

12.85

5.51

7.059

5.885

4.827

4.611

4.923

6.006

8.543

9.941

11.82

8.179

7.878

6.407

5.664

5.615

5.591

8.049

11.23

12.29

10.51

10.44

7.449

7.003

7.822

8.883

18.76

13.29

15.8

13.48

10.78

9.778

10.87

13.28

12.1

17.12

19.27

19.92

17.79

17.98

16.75

11.98

11.77

15.14

16.32

19.69

19.13

18.55

14.06

12.75

16.24

19.44

17.67

15.04

18.51

19.95

18.98

17.99

0

5

10

15

20

25

30

35

40

NaN

(b) OpenFace 2.0

-60

-30 0 30 60

Yaw angles

-30

0

30

60

Pitch

an

gle

s

31.75

57.92

74.33

74.42

74.79

74.87

72.66

56.3

55.28

56.29

71.98

33.04

60.01

69.13

73.78

80.66

75.65

72.81

69.59

72.87

73.02

60.48

60.33

34.39

36.18

35.77

65.6

64.78

62.99

63.32

76.73

69.22

67.26

56.77

20.42

28.48

37.57

39.98

35.86

75.74

64.49

61.94

70.72

61.07

23.77

29.39

29.67

35.95

46.09

63.79

57.54

61.73

66.29

52.06

42.59

30.8

31.16

40.71

43.45

58.98

56.75

53.38

59.66

61.09

35.37

27.82

33.86

33.38

46.7

48.76

48.13

58.62

41.99

30.82

31.75

35.47

47.83

34.55

30.34

39.72

46.9

42.21

33.23

36.62

29.18

34.53

36.51

36.17

35.69

37.79

46.19

60.95

27.8

28.66

24.37

28.81

29.57

30.39

37.8

34.97

35.74

39.63

29.1

28.43

26.75

22.17

22.81

22.88

27.78

28.52

28.12

34.03

35.99

29.45

28.6

20.16

22.62

22.9

26.86

26.45

33.97

35.29

20.05

20.4

34.93

22.85

31.82

21.58

21.42

31.52

35.89

30.36

27.98

20.82

31.06

37.22

37.09

33.09

22.98

20.02

26.88

32.22

33.28

23.57

24.1

22.1

34.17

28.7

25.9

28.41

22.68

20.69

23.23

20.76

21.07

24.82

35.79

52.07

25.13

26.58

29.34

26.54

23.25

23.3

21.18

32.05

29.3

22.52

25.81

33.69

39.92

25.84

37.87

32.43

31.05

26.67

26.01

27.58

31.48

37.25

26.26

23.98

36.3

45.13

34.39

33.85

31.77

28.56

24.66

28.65

30.85

77.12

52.06

42.63

33.62

34.51

28.58

28.25

36.32

30.43

43.46

58.62

38.06

45.98

47.96

44.45

27.61

42.63

41.43

39.8

44.7

51.69

51.26

49.58

37.89

50.49

45.82

42.04

20.86

29.5

38.37

48.3

54.14

48.31

48.29

60.45

59.27

61.55

82.73

64.37

37.6

32.55

45.33

56.3

52.62

52.64

67.43

54.95

62.47

64.86

85.11

56.56

81.1

60.55

44.6

13.94

13.87

17.71

15.75

18.03

17.7

17.27

12.95

11.81

16.09

18.14

12.96

14.69

14.29

10.01

9.796

11.91

14.84

17.44

18.56

17.47

13.66

11.63

11.11

12.94

15.43

16.93

14.3

13.08

16.43

18.48

17.8

15.03

17.4

18.35

14.06

19.22

17.8

17.3

16.83

0

5

10

15

20

25

30

35

40

NaN

(c) FAN

Fig. 6: Performance of HPE algorithm as a function of yaw(horizontal) and pitch (vertical) angles. The graph shows theaverage geodesic distance between the true and predictedangles per bin. Bins with darker color indicates lower errors.The bins with white color and no number indicate that noframes are available in the specific rotation range.

suitable for this purpose than the baselines. Furthermore, theproposed algorithm can also work in cases with very lowillumination or even during the night. In these scenarios,RGB based approaches cannot be used.

In the future, we would like to add temporal modelingto our framework to leverage the strong correlation betweenconsecutive head poses. Our multimodal data is suitable forthis task, since we have head rotation labels for all the framesin the corpus. We are also working to increase the size of thecorpus by recording new subjects. We expect more robust andaccurate models by increasing our training set, which willenable our model to better capture more facial variabilityacross drivers.

REFERENCES

[1] E. Murphy-Chutorian, A. Doshi, and M. Trivedi, “Head pose estima-tion for driver assistance systems: A robust algorithm and experimentalevaluation,” in IEEE Intelligent Transportation Systems Conference(ITSC 2007), Seattle, WA, USA, September-October 2007, pp. 709–714.

Page 7: Robust Driver Head Pose Estimation in Naturalistic ... · camera recording is through depth maps, which project depth information into a single view [9], [10], [12], [15]. This approach

[2] S. Ba and J.-M. Odobez, “Recognizing visual focus of attention fromhead pose in natural meetings,” IEEE Transactions on Systems, Man,and Cybernetics, Part B (Cybernetics), vol. 39, no. 1, pp. 16–33,February 2009.

[3] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, “Appearance-basedgaze estimation in the wild,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR 2015), Boston, MA, USA, June 2015,pp. 4511–4520.

[4] S. Jha and C. Busso, “Analyzing the relationship between head poseand gaze to model driver visual attention,” in IEEE InternationalConference on Intelligent Transportation Systems (ITSC 2016), Riode Janeiro, Brazil, November 2016, pp. 2157–2162.

[5] ——, “Probabilistic estimation of the driver’s gaze from head orien-tation and position,” in IEEE International Conference on IntelligentTransportation (ITSC), Yokohama, Japan, October 2017, pp. 1630–1635.

[6] ——, “Probabilistic estimation of the gaze region of the driver usingdense classification,” in IEEE International Conference on IntelligentTransportation (ITSC 2018), Maui, HI, USA, November 2018, pp.697–702.

[7] N. Li and C. Busso, “Predicting perceived visual and cognitivedistractions of drivers with multimodal features,” IEEE Transactionson Intelligent Transportation Systems, vol. 16, no. 1, pp. 51–65,February 2015.

[8] ——, “Detecting drivers’ mirror-checking actions and its applicationto maneuver and secondary task recognition,” IEEE Transactions onIntelligent Transportation Systems, vol. 17, no. 4, pp. 980–992, April2016.

[9] G. Borghi, M. Venturelli, R. Vezzani, and R. Cucchiara, “POSEidon:Face-from-depth for driver pose estimation,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR 2017), Honolulu,HI, USA, July 2017, pp. 5494–5503.

[10] A. Schwarz, M. Haurilet, M. Martinez, and R. Stiefelhagen, “DriveA-Head - a large-scale driver head pose dataset,” in IEEE Conference onComputer Vision and Pattern Recognition Workshops (CVPRW 2017),Honolulu, HI, USA, July 2017, pp. 1165–1174.

[11] B. Raytchev, I. Yoda, and K. Sakaue, “Head pose estimation bynonlinear manifold learning,” in IEEE International Conference onPattern Recognition (ICPR 2004), vol. 4, Cambridge, UK, August2004, pp. 462–466.

[12] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimationwith random regression forests,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2011), Providence, RI, USA,June 2011, pp. 617–624.

[13] Y. Yan, E. Ricci, R. Subramanian, O. Lanz, and N. Sebe, “Nomatter where you are: Flexible graph-guided multi-task learning formulti-view head pose classification under target motion,” in IEEEInternational Conference on Computer Vision (ICCV 2013), Sydney,NSW, Australia, December 2013, pp. 1177–1184.

[14] S. S. Mukherjee and N. M. Robertson, “Deep head pose: Gaze-direction estimation in multimodal video,” IEEE Transactions onMultimedia, vol. 17, no. 11, pp. 2094–2107, November 2015.

[15] M. Venturelli, G. Borghi, R. Vezzani, and R. Cucchiara, “Deep headpose estimation from depth data for in-car automotive applications,” inInternational Workshop on Understanding Human Activities through3D Sensors (UHA3DS 2016), ser. Lecture Notes in Computer Science,H. Wannous, P. Pala, M. Daoudi, and F. Florez-Revuelta, Eds. Can-cun, Mexico: Springer Berlin Heidelberg, December 2018, vol. 10188,pp. 74–85.

[16] Q. Ji and X. Yang, “Real-time eye, gaze, and face pose tracking formonitoring driver vigilance,” Real-Time Imaging, vol. 8, no. 5, pp.357–377, October 2002.

[17] S. Jha and C. Busso, “Challenges in head pose estimation of driversin naturalistic recordings using existing tools,” in IEEE InternationalConference on Intelligent Transportation (ITSC), Yokohama, Japan,October 2017, pp. 1624–1629.

[18] C. Qi, L. Yi, H. Su, and L. Guibas, “PointNet++: Deep hierarchicalfeature learning on point sets in a metric space,” in In Advances inNeural Information Processing Systems (NIPS 2017), Long Beach,CA, USA, December 2017, pp. 5099–5108.

[19] S. Jha and C. Busso, “Fi-Cap: Robust framework to benchmark headpose estimation in challenging environments,” in IEEE InternationalConference on Multimedia and Expo (ICME 2018), San Diego, CA,USA, July 2018, pp. 1–6.

[20] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. Morency, “OpenFace 2.0:Facial behavior analysis toolkit,” in IEEE Conference on AutomaticFace and Gesture Recognition (FG 2018), Xi’an, China, May 2018,pp. 59–66.

[21] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2D& 3D face alignment problem? (and a dataset of 230,000 3D faciallandmarks),” in IEEE International Conference on Computer Vision(ICCV 2017), Venice, Italy, October 2017, pp. 1021–1030.

[22] E. Murphy-Chutorian and M. M. Trivedi, “Head pose estimation incomputer vision: A survey,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 31, no. 4, pp. 607–626, April 2009.

[23] B. Czuprynski and A. Strupczewski, “High accuracy head pose track-ing survey,” in International Conference on Active Media Technology(AMT 2014), ser. Lecture Notes in Computer Science, D. Slezak,G. Schaefer, S. Vuong, and Y. Kim, Eds. Warsaw, Poland: SpringerBerlin Heidelberg, August 2014, vol. 8610, pp. 407–420.

[24] X. Liu, W. Liang, Y. Wang, S. Li, and M. Pei, “3D head poseestimation with convolutional neural network trained on syntheticimages,” in IEEE International Conference on Image Processing (ICIP2016), Phoenix, AZ, USA, September 2016, pp. 1289–1293.

[25] G. P. Meyer, S. Gupta, I. Frosio, D. Reddy, and J. Kautz, “Robustmodel-based 3D head pose estimation,” in IEEE International Con-ference on Computer Vision (ICCV 2015), Santiago, Chile, December2015, pp. 3649–3657.

[26] T. Bar, J. Reuter, and J. Zollner, “Driver head pose and gaze esti-mation based on multi-template ICP 3-D point cloud alignment,” inInternational IEEE Conference on Intelligent Transportation Systems(ITSC 2012), Anchorage, AK, USA, September 2012, pp. 1797–1802.

[27] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao,“3D ShapeNets: A deep representation for volumetric shapes,” inIEEE Conference on Computer Vision and Pattern Recognition (CVPR2015), Boston, MA, June 2015, pp. 1912–1920.

[28] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-viewconvolutional neural networks for 3D shape recognition,” in IEEEInternational Conference on Computer Vision (ICCV 2015), Santiago,Chile, December 2015, pp. 945–953.

[29] C. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning onpoint sets for 3D classification and segmentation,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR 2017), Honolulu,HI, USA, July 2017, pp. 77–85.

[30] Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li, “On the continuity ofrotation representations in neural networks,” in IEEE/CVF Conferenceon Computer Vision and Pattern Recognition (CVPR), Long Beach,CA, USA, June 2019, pp. 5738–5746.

[31] P. Angkititrakul, M. Petracca, A. Sathyanarayana, and J. Hansen,“UTDrive: Driver behavior and speech interactive systems for in-vehicle environments,” in IEEE Intelligent Vehicles Symposium, Is-tanbul, Turkey, June 2007, pp. 566–569.

[32] P. Angkititrakul, D. Kwak, S. Choi, J. Kim, A. Phucphan, A. Sathya-narayana, and J. Hansen, “Getting start with UTDrive: Driver-behaviormodeling and assessment of distraction for in-vehicle speech systems,”in Interspeech 2007, Antwerp, Belgium, August 2007, pp. 1334–1337.

[33] J. Hansen, C. Busso, Y. Zheng, and A. Sathyanarayana, “Drivermodeling for detection and assessment of driver distraction: Examplesfrom the UTDrive test bed,” IEEE Signal Processing Magazine,vol. 34, no. 4, pp. 130–142, July 2017.

[34] E. Olson, “AprilTag: A robust and flexible visual fiducial system,” inIEEE International Conference on Robotics and Automation (ICRA2011), Shanghai, China, May 2011, pp. 3400–3407.

[35] T. B. A. Zadeh, Y. C. Lim and L. Morency, “Convolutional expertsconstrained local model for 3D facial landmark detection,” in IEEEInternational Conference on Computer Vision Workshops (ICCVW2017), Venice, Italy, October 2017, pp. 2519–2528.

[36] V. Ferrario, C. Sforza, G. Serrao, G. Grassi, and E. Mossi, “Activerange of motion of the head and cervical spine: a three?dimensionalinvestigation in healthy young adults,” Journal of Orthopaedic Re-search, vol. 20, no. 1, pp. 122–129, January 2002.

[37] D. Huynh, “Metrics for 3D rotations: Comparison and analysis,”Journal of Mathematical Imaging and Vision, vol. 35, no. 2, pp. 155–164, June 2009.

[38] J. Jain and C. Busso, “Analysis of driver behaviors during commontasks using frontal video camera and CAN-Bus information,” in IEEEInternational Conference on Multimedia and Expo (ICME 2011),Barcelona, Spain, July 2011.