looking at faces in a vehicle: a deep cnn based approach...

6
Looking at Faces in a Vehicle: A Deep CNN Based Approach and Evaluation Kevan Yuen, Sujitha Martin and Mohan M. Trivedi University of California San Diego [email protected], [email protected], [email protected] Abstract— The driver’s face is key to a less intrusive method to monitoring the driver to derive information such as distrac- tion, drowsiness, intent, and where they are looking. A vital step in extracting these higher level information is to find the driver’s face and individual components such as eyes, nose and mouth, along with the direction they are facing towards. In the context of safety critical situation like driving, it is important that this vital step be robust otherwise the higher level information is not available or unreliable. The lighting condition of a driver’s cabin varies greatly from dark due to driving under a bridge or in a parking structure to extremely bright on a sunny day. Various occlusions on the face may also occur due to hand activities such as drinking water or different gestures. This work introduces a system based on existing CNN structures with slight modifications to implement face detection, landmark localization, and landmark-based head pose estimation method which addresses the challenges found in the driver cabin. To handle these challenges, training samples are artificially augmented for the purpose of developing a system robust in the environment of a vehicle. I. I NTRODUCTION Main factors on the cause of vehicular accidents are due to driver distraction, drowsiness, and inattention [1]. From 2005 through 2009, an estimated annual average of 886 fatal crashes and 37,000 injury crashes occurred involving drowsy driving [2]. At least 3,000 and 380,000 people were killed and injured, respectively, annually in crashes involving distracted drivers from 2010 through 2014 [2][3][4][5][6]. In active safety systems for intelligent vehicles assisting the driver, monitoring the face can help prevent these types of accidents by warning the driver ahead of time about other cars or pedestrians that they have not seen by tracking their head pose and gaze, or by monitoring their eyes and mouth to detect drowsiness through indicators such as yawning or PERCLOS, which was found to be significantly correlated with driver fatigue [7]. In order to extract higher level information such as where the driver is focused on or if the driver is drowsy, a reliable but non-intrusive system is required to detect the driver’s face, locate various landmarks around the face such as eyes and mouth in detail, and estimate head pose to determine an approximate direction that the driver is looking towards. Developing such a system paves the path towards working on higher level information in future research. In this paper, a modular framework is presented to locate a face in an image and localizing facial landmarks for estimating head pose. Only a limited amount of annotated data for face location and landmarks are publicly available, and these types of datasets are generally well-lit scenes or posed with minimal occlusions on the face. These types of datasets will not be representative of the real-world challenges found on the road with harsh lighting and occlusions. Our contributions are two-fold: 1) the augmentation of training samples to robustly perform face detection and landmark localization under varying occlusion, and 2) cascading the AlexNet and Stacked Hourglass networks together, which are very successful in their own domains, and applying them in the domain of faces in the car. Quantitative evaluations shows the success of our system on a test set composed of naturalistic driving data for both face detection and head pose estimation. II. RELATED STUDIES Recent deep convolutional neural networks (DCNN) for training detectors in the face domain have achieved top performing results such as DDFD [8], Faceness,[9], Cas- cadeCNN [10], and DenseBox [11]. DDFD fine-tunes AlexNet [12], a well known CNN structure, to train a face detector by generating a score image and extracting detected faces. In Faceness, multiple DCNN components are trained to detect individual components of the face, e.g. eyes, nose, mouth, etc., and merges these individual responses to find a face. CascadeCNN achieves faster run times by cascading multiple CNNs to quickly reject non-face background areas in the early stages of the detector with less parameters, while the later stages have increasing complexity to reject more difficult false positives. DenseBox introduced an end-to-end FCN framework di- rectly predicting bounding box and scores, and showed that accuracy can be improved by incorporating landmark localization into the network. Another work is the Stacked Hourglass network [13], which is a very dense network used to predicting body joint locations given the location of a pedestrian. Our work is mainly composed of two CNN systems, following the idea of CascadeCNN by cascading two different networks together. The first stage is an AlexNet face detector, following closely to the work of DDFD. The Stacked Hourglass is used as the second stage, as Faceness and DenseBox found success in face detection by incorporating facial part responses. 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC) Windsor Oceanico Hotel, Rio de Janeiro, Brazil, November 1-4, 2016 978-1-5090-1889-5/16/$31.00 ©2016 IEEE 649

Upload: others

Post on 10-Oct-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

Looking at Faces in a Vehicle: A Deep CNN Based Approach andEvaluation

Kevan Yuen, Sujitha Martin and Mohan M. TrivediUniversity of California San Diego

[email protected], [email protected], [email protected]

Abstract— The driver’s face is key to a less intrusive methodto monitoring the driver to derive information such as distrac-tion, drowsiness, intent, and where they are looking. A vital stepin extracting these higher level information is to find the driver’sface and individual components such as eyes, nose and mouth,along with the direction they are facing towards. In the contextof safety critical situation like driving, it is important that thisvital step be robust otherwise the higher level information isnot available or unreliable. The lighting condition of a driver’scabin varies greatly from dark due to driving under a bridgeor in a parking structure to extremely bright on a sunny day.Various occlusions on the face may also occur due to handactivities such as drinking water or different gestures. Thiswork introduces a system based on existing CNN structureswith slight modifications to implement face detection, landmarklocalization, and landmark-based head pose estimation methodwhich addresses the challenges found in the driver cabin.To handle these challenges, training samples are artificiallyaugmented for the purpose of developing a system robust inthe environment of a vehicle.

I. INTRODUCTION

Main factors on the cause of vehicular accidents are dueto driver distraction, drowsiness, and inattention [1]. From2005 through 2009, an estimated annual average of 886fatal crashes and 37,000 injury crashes occurred involvingdrowsy driving [2]. At least 3,000 and 380,000 people werekilled and injured, respectively, annually in crashes involvingdistracted drivers from 2010 through 2014 [2][3][4][5][6].In active safety systems for intelligent vehicles assisting thedriver, monitoring the face can help prevent these types ofaccidents by warning the driver ahead of time about othercars or pedestrians that they have not seen by tracking theirhead pose and gaze, or by monitoring their eyes and mouthto detect drowsiness through indicators such as yawning orPERCLOS, which was found to be significantly correlatedwith driver fatigue [7].

In order to extract higher level information such as wherethe driver is focused on or if the driver is drowsy, a reliablebut non-intrusive system is required to detect the driver’sface, locate various landmarks around the face such as eyesand mouth in detail, and estimate head pose to determinean approximate direction that the driver is looking towards.Developing such a system paves the path towards working onhigher level information in future research. In this paper, amodular framework is presented to locate a face in an imageand localizing facial landmarks for estimating head pose.Only a limited amount of annotated data for face location

and landmarks are publicly available, and these types ofdatasets are generally well-lit scenes or posed with minimalocclusions on the face. These types of datasets will notbe representative of the real-world challenges found on theroad with harsh lighting and occlusions. Our contributionsare two-fold: 1) the augmentation of training samples torobustly perform face detection and landmark localizationunder varying occlusion, and 2) cascading the AlexNetand Stacked Hourglass networks together, which are verysuccessful in their own domains, and applying them in thedomain of faces in the car. Quantitative evaluations shows thesuccess of our system on a test set composed of naturalisticdriving data for both face detection and head pose estimation.

II. RELATED STUDIES

Recent deep convolutional neural networks (DCNN) fortraining detectors in the face domain have achieved topperforming results such as DDFD [8], Faceness,[9], Cas-cadeCNN [10], and DenseBox [11]. DDFD fine-tunesAlexNet [12], a well known CNN structure, to train a facedetector by generating a score image and extracting detectedfaces. In Faceness, multiple DCNN components are trainedto detect individual components of the face, e.g. eyes, nose,mouth, etc., and merges these individual responses to finda face. CascadeCNN achieves faster run times by cascadingmultiple CNNs to quickly reject non-face background areasin the early stages of the detector with less parameters, whilethe later stages have increasing complexity to reject moredifficult false positives.

DenseBox introduced an end-to-end FCN framework di-rectly predicting bounding box and scores, and showedthat accuracy can be improved by incorporating landmarklocalization into the network. Another work is the StackedHourglass network [13], which is a very dense networkused to predicting body joint locations given the locationof a pedestrian. Our work is mainly composed of two CNNsystems, following the idea of CascadeCNN by cascadingtwo different networks together. The first stage is an AlexNetface detector, following closely to the work of DDFD.The Stacked Hourglass is used as the second stage, asFaceness and DenseBox found success in face detection byincorporating facial part responses.

2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC)Windsor Oceanico Hotel, Rio de Janeiro, Brazil, November 1-4, 2016

978-1-5090-1889-5/16/$31.00 ©2016 IEEE 649

Page 2: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

Face Detector ModuleNetwork Structure: AlexNet (FCN) Landmark Module

Network Structure: Stacked Hourglass

Head Pose Estimator

Method: POSIT

Pre-processedInput Image

Proposed System

Head Pose(Pitch, Yaw, Roll)

Face Box & Score68 Landmark Locations

Fig. 1: Our proposed CNN system using two different CNN structures for detecting a face, localizing landmarks andestimating head pose. The input image is preprocessed with CLA Histogram Equalization and fed into a the face detectormodule to output face detections. The detected faces are processed by the landmark module to estimate landmark locations,re-estimate the face detection’s score to reject false positives, and re-localize the bounding box of the face for better overlap.To estimate head pose, eight rigid landmarks (see red dots on the face model) are chosen to estimate the pose using a 3Dgeneric face model and POSIT algorithm.

III. PROPOSED METHOD

In this work, existing CNN structures are modified andtrained to predict face location, landmarks and head pose.For face detection, we train the AlexNet structure to detectfaces as described in our other work [14], similar to DDFD.To refine the detected face’s location, score, and also toestimate head pose, we train a stacked hourglass networkwith intermediate supervision. Face Location can be refinedby generating a tighter-bounding box around the face usingthe location of estimated landmarks. The hourglass networkoutput also provides scoring information that allows us tocompute a refined score for the detected face to furtherreject false positives. Head pose is estimated with POSITalgorithm [15] with a 3D generic face [16] and selectedrigid landmarks. The proposed system is shown in Fig. 1.For our application in the vehicle domain for driving safetywhere there are harsh lighting and occlusions, it is criticalthat our system is able to continue monitoring the driver’sface and behavior even under these situations. Therefore, oneof the key contribution of this work is heavily augmentingthe training samples to include more examples of facesunder harsh lighting and occlusion to explore improvementsfor handling these scenarios for both face detection andlandmark localization.

A. Training Sample Generation

The AFLW dataset [17] provides 25,000 annotated faceswith boxes, which we use for training our face detector.Although the AFLW dataset does provide landmark anno-tations, landmarks that were occluded were not annotatedat all. Instead, we opted to use a smaller dataset of about3,400 faces but with more landmarks and with annotationsprovided even under occlusion. The dataset is composed of

images from new and existing datasets such as LFPW [18],HELEN [19], AFW [20], and 300-W face datasets withlandmarks provided by the 300-W challenge [21][22][23].The provided annotations from the 300-W datasets do notattempt to annotate every face in the image, so AFLW isused instead to generate the negative samples for landmarktraining as we found most, if not all, of the faces wereannotated.

Since dataset contains faces that are generally well-lit orwith minimal occlusion, lighting and occlusion augmenta-tions are applied to the samples to develop a more robustsystem to handle these types of conditions. Due to limitedspace, we will briefly describe the overall procedure forgenerating training samples for the landmark training butrefer the reader to our previous work on face detectortraining for more details [14]. Random square windowsare sampled from the training images and are classified aspositive and negative samples if it is greater than 70% IOU(Intersection Over Union) or less than 5% IOU, respectively.The windows are then randomly rotated between -15 to 15degrees as faces may be slightly rotated in real data. Wechose to start our experiments on the safe side with a lowerIOU threshold for negative sample requirement and smallerrotation augmentation as compared to the higher values usedin our previous work on training the face detector.

For occlusion augmentations, we use the SUN2012 dataset[24] containing over pixel labeled 250,000 regions and ob-jects. The large variety of objects helps prevent the networksfrom overfitting to learning the objects. Objects are randomlyselected from the dataset and placed on top of the face sampleto mimic the effects of occlusion. More details are providedin our face detection paper. It was found in our previous ex-periments that applying a contrast-limited adaptive histogram

650

Page 3: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

equalization (CLAHE) on the luma channel in YCrCb spacehelps mitigate the effects of varying lighting conditions, sowe apply this algorithm to all our samples so that the networkis trained to work with these classify histogram-equalizedimages.

For face detection training, the target labels are simplya binary value of 1 for face or 0 for non-face. The targetlabels of the stacked hourglass network are 68 score or heatmap images indicating the location for each of the 68 faciallandmarks. For negative samples, all 68 score images arefilled with zeros. For positive samples, each score image is a2-D Gaussian given by eq. 1 where the mean (xLM , yLM ) isthe location of the landmark and standard deviation σ = 1.5.

G(x, y) = exp

(−((x− xLM )2

2σ2+

(y − yLM )2

2σ2

))(1)

B. Model Training

The AlexNet structure is modified to output 2 class prob-abilities to determine whether or not it is a face, insteadof the original 1000 classes. The fully connected layers areconverted to convolutional layers so that the AlexNet cangenerate a probability or heat map to extract the location of aface from a larger image as done in DenseNet [25]. Trainingwas done with Caffe [26] using stochastic gradient descentand 256 images per batch, 70 positive samples and 186negative samples. For landmark estimation, we use the twostacked hourglass model, as described in the original paper,with the output feature dimension changed to 68 landmarks.The network is trained with Torch [27] using their publiclyavailable code for approximately 300,000 iterations with amini batch size of 6.

C. Testing Phase

The entire image is passed through the AlexNet structurewhich will output a heat map for detected faces and areprocessed using heuristic methods, described in our previouswork, to generate bounding boxes and scores. To handlefaces at different scales, an image pyramid is generated. Alldetected faces are cropped out from the original image, pre-processed, and fed through the hourglass network to extracta set of score images. Each detection passed through thenetwork will output 68 score images corresponding to eachof the 68 landmarks, visualized in Fig. 2. The score imagefor a given landmark is thresholded at a value of 0.6 timesthe highest value in image to generate a binary image, andconnected component labeling is applied (e.g. MATLAB’sbwlabel function) to group clusters of pixels together. Thelocation of the landmark is then computed as the weightedcentroid of the group of pixels with the highest value in thescore image (i.e. MATLAB’s regionprops function). This isrepeated for the remaining landmarks. The detection box isthen re-localized based on the minimum and maximum ofthe x-y coordinates of all landmarks and extending the topof this box by 20% so that the forehead area of the face isincluded.

(a) (b)

Fig. 2: (a) Example output face from face detector, and(b) resulting score image output from Stacked Hourglassnetwork. Note that the 68 score images are combined intoa single image for visualization purposes only, and each ofthe score images are ideally a single 2-D Gaussian locatedat the landmark.

To refine the face detection score using landmark infor-mation, an ideal ground truth image is generated by placinga 2-D Gaussian at the predicted location of the landmarkfollowing the ground truth sample generation described ear-lier. The score of the ith landmark SLMi can be computedbased on how much the estimate score image deviates fromits ideal ground truth image using sum of squared error asthe measure as given by eq. 2:

SLMi= −

∑x

∑y

(|ISLMi(ideal)(x, y)|−|ISLMi

(est)(x, y)|)2

(2)where |ISLMi

(est)(x, y)| corresponds to a pixel at loca-tion (x,y) in the score image for a given landmark i and|ISLMi

(ideal)(x, y)| represents the same information exceptit is the generated ideal ground truth image. A low scoringlandmark would have a large negative value, while a highscoring landmark would have smaller negative value closer tozero. The refined face detector score SRFD can be computedas the sum of the individual landmark scores given by eq. 3

SRFD =

68∑i=1

SLMi(3)

The pose is estimated using the POSIT algorithm [15]with a 3D generic face [16] and a subset of the estimatedlandmark positions. We chose to use 7 rigid landmarks: twoeye corners from each eye, nose tip, two nose corners, andthe chin. Non-rigid landmarks such as mouth corners, whichmay move while the person is talking or yawning, are notused for pose estimation as they would deform from the 3Dgeneric face model much more, leading to higher estimationerror.

IV. PERFORMANCE EVALUATION ON NATURALISTICDRIVING DATA

The evaluation is done on the VIVA-Face dataset [28],a face detection and head pose challenge featuring harshlighting conditions and facial occlusions due to objects such

651

Page 4: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

Fig. 3: Example output of our system showing the original face detection output (red dashed line) using AlexNet network, 68landmark estimation (white dots) and refined face detection localization (solid white line) using Stacked Hourglass network,and head pose (blue, green, and red solid lines which each correspond to a unit vector coming out of the head from thefront, top, and left, respectively). The first three rows show good example outputs of our system on landmark localizationand head pose estimation, and the last row shows bad examples.

as hands or the vehicle’s sun visor. It is comprised ofimages sampled from 39 video sequences from both daytime naturalistic driving data recorded at LISA-UCSD andnaturalistic driving videos, to the best of our knowledge,collected from YouTube, providing a wide variety of cameraplacements and up to 4 faces in an image. The challenge isevaluated on three different levels based on occlusion. Level-0 (L0): Contains all faces, both non-occluded and occludedfaces (607 faces in total), which is the standard evaluationmetric adopted in literature where there is no separation offaces into different categories. L0 is then broken down intotwo disjoint sets based on occlusion, Level-1 (L1): faces withno occluded face parts, as defined in the challenge (323faces), and Level-2 (L2): faces with at least one face partoccluded (total 284 faces).

Face detectors are evaluated based on the well-establishedprecision-recall curve using the standard PASCAL overlaprequirement of 50%, shown in Fig. 4. Alongside the twoface detectors (Our method no occ and Our method occ)

trained in our previous work, baseline methods such asBoosted cascade with Haar [29], Boosted cascade with ACF[30] and Mixture of Tree Structure [31] are also shownfor comparison. For evaluating our new proposed system(Our method hourglass), slight improvement was found byinputting face candidate windows from both of our previ-ously trained face detectors using AlexNet into the landmarkmodule as opposed to just one of them, in the spirit ofensemble methods. This slight improvement is due to thedifferent faces that the systems are able to recall, and thelandmark module acts as a second stage for the face detectorusing landmark information to reject false positives. Fromthe curves, it is shown that the precision issue of the modeltrained with occlusion augmentation in our previous work,Our method occ, is no longer an issue after refining the facedetector scores with the landmark module. Fig. 4a shows theperformance of the refined face detector to have improvedthe system by rejecting false positives and localizing facesbetter with the new landmark model. Fig. 4b reveals that it

652

Page 5: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

0 0.2 0.4 0.6 0.8 1

Recall

0

0.2

0.4

0.6

0.8

1P

reci

sion

L0 (all faces)

Our_method_hourglass (AP: 91.2)Anonymous_1 (AP: 88.3)Our_method_no_occ (AP: 85.8)Our_method_occ (AP: 85.4)Zhu_MultiPIE_Ind (AP: 58.5)ACF (AP: 53.1)Viola_Jones

0 0.2 0.4 0.6 0.8 1

Recall

0

0.2

0.4

0.6

0.8

1

Pre

cisi

on

L1 (no parts occluded)

Our_method_hourglass (AP: 99.3)Anonymous_1 (AP: 94.8)Our_method_no_occ (AP: 93.6)Our_method_occ (AP: 88.6)Zhu_MultiPIE_Ind (AP: 62.6)ACF (AP: 66.1)Viola_Jones

0 0.2 0.4 0.6 0.8 1

Recall

0

0.2

0.4

0.6

0.8

1

Pre

cisi

on

L2 (atleast one part occluded)

Our_method_hourglass (AP: 80.6)Anonymous_1 (AP: 78.7)Our_method_no_occ (AP: 75.8)Our_method_occ (AP: 80.6)Zhu_MultiPIE_Ind (AP: 52.1)ACF (AP: 35.6)Viola_Jones

(a) (b) (c)

Fig. 4: Precision-Recall curves of the proposed method and baseline methods on the VIVA dataset.

TABLE I: Evaluation results from benchmark and submissions on yaw angle estimation of the head pose. The evaluation issplit by occlusion levels: L0 (all 607 faces in 458 images), L1 (323 non-occluded faces in 289 images), L2 (284 faces withat least one occluded part in 240 images). DR, detection rate, is the percentage of images for which the face detector wasable to detect a face with at least 50% overlap with the ground truth face. SR, success rate, is the percentage of correctlydetected faces for which the estimated yaw angle was within 15 degrees of the annotation. µAE and σAE are the mean andstandard deviation of the absolute yaw error (in degrees), respectively, calculated only from the correctly detected faces.

Occlusion LevelL0 (all faces) L1 (no parts occluded) L2 (atleast one part occluded)

Benchmark/Submission DR SR µAE σAE DR SR µAE σAE DR SR µAE σAE

Our method hourglass 93.2% 93.4% 5.6◦ 10.0◦ 99.7% 98.6% 3.6◦ 3.4◦ 86.1% 86.9% 8.2◦ 14.2◦

Anonymous 4 60.8% 66.9% 13.9◦ 12.6◦ 72.4% 68.6% 13.6◦ 12.8◦ 48.1% 64.1% 14.4◦ 12.3◦

Zhu MultiPIE Ind 67.3% 63.1% 16.0◦ 16.5◦ 70.0% 68.5% 14.0◦ 13.5◦ 64.2% 56.7% 18.3◦ 19.1◦

is near perfect at detecting non-occluded faces, while Fig.4c shows that it is top performing, however there is stilla lot more room for improvement to work on for handlingoccluded faces.

Head pose estimation is evaluated on the same levels of theVIVA-Face dataset: L0, L1 and L2. Evaluation and metricsare shown in Table I. Our system is able to achieve very highdetection rates of at least 86.1% across all levels, and ableto estimate the yaw angle of the head pose with an averageerror of at least 8.2◦. Examples of our proposed system forface detection, landmark localization and pose estimationunder various lighting, pose, and occlusion situations foundin naturalistic driving is shown in Fig. 3.

V. CONCLUDING REMARKS

In this paper, a system is introduced to propose initial facedetections using the AlexNet structure and passed througha second stage refinement using the Stacked HourglassNetwork to estimate landmarks and refine the face detectionlocalization and scores. It was shown that the second stagerefinement had significantly improved precision from thestage one face detectors. Using the predicted rigid landmarks,the system is able provide an estimate of the head posequite accurately. The networks were trained with augmentedtraining samples specifically to handle harsh lighting andheavy occlusions found in naturalistic driving data. Thiswork provides a strong basis for the development higher levelfacial analysis of driver’s state and focus such as distraction,drowsiness, emotions and gaze zone estimation.

VI. ACKNOWLEDGMENTS

We thank the reviewers and editors for their valuablecomments while preparing the manuscript, and thank ourcolleagues at the Computer Vision and Robotics ResearchLaboratory for their assistance.

REFERENCES

[1] S. G. Klauer, F. Guo, J. Sudweeks, and T. A. Dingus, “An analysisof driver inattention using a case-crossover approach on 100-car data:Final report,” Tech. Rep., 2010.

[2] NHTSA, “Traffic safety facts: Drowsy driving,” https://crashstats.nhtsa.dot.gov/Api/Public/Publication/811449, accessed: 2016-08-10.

[3] ——, “Traffic safety facts: Distracted driving 2011,” http://www.distraction.gov/downloads/pdfs/traffic-safety-facts-04-2013.pdf,accessed: 2016-08-10.

[4] ——, “Traffic safety facts: Distracted driving 2012,” http://www.distraction.gov/downloads/pdfs/812012.pdf, accessed: 2016-08-10.

[5] ——, “Traffic safety facts: Distracted driving 2013,” http://www.distraction.gov/downloads/pdfs/Distracted Driving 2013Research note.pdf, accessed: 2016-08-10.

[6] ——, “Traffic safety facts: Distracted driving 2014,” https://crashstats.nhtsa.dot.gov/Api/Public/Publication/812260, accessed: 2016-08-10.

[7] Z. Pei, S. Zhenghe, and Z. Yiming, “Perclos-based recognition algo-rithms of motor driver fatigue,” Journal-China Agricultural University,vol. 7, no. 2, pp. 104–109, 2002.

[8] S. S. Farfade, M. Saberian, and L.-J. Li, “Multi-view face de-tection using deep convolutional neural networks,” arXiv preprintarXiv:1502.02766, 2015.

[9] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responsesto face detection: A deep learning approach,” in International Confer-ence on Computer Vision (ICCV), 2015.

[10] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutionalneural network cascade for face detection,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2015.

[11] L. Huang, Y. Yang, Y. Deng, and Y. Yu, “Densebox: Unifyinglandmark localization with end to end object detection,” arXiv preprintarXiv:1509.04874, 2015.

653

Page 6: Looking at Faces in a Vehicle: A Deep CNN Based Approach ...cvrr.ucsd.edu/publications/2016/YuenMartinTrivediITSC...DenseBox introduced an end-to-end FCN framework di-rectly predicting

[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012.

[13] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks forhuman pose estimation,” arXiv preprint arXiv:1603.06937, 2016.

[14] K. Yuen, S. Martin, and M. M. Trivedi, “On looking at facesin an automobile: Issues, algorithms and evaluation on naturalisticdriving dataset,” in International Conference on Pattern Recognition(ICPR2016). IEEE, 2016.

[15] D. F. Dementhon and L. S. Davis, “Model-based object pose in 25lines of code,” International journal of computer vision, vol. 15, no.1-2, pp. 123–141, 1995.

[16] S. Martin, A. Tawari, E. Murphy-Chutorian, S. Y. Cheng, andM. Trivedi, “On the design and evaluation of robust head pose forvisual user interfaces: Algorithms, databases, and comparisons,” inProceedings of the 4th International Conference on Automotive UserInterfaces and Interactive Vehicular Applications. ACM, 2012, pp.149–154.

[17] M. Kostinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotatedfacial landmarks in the wild: A large-scale, real-world database forfacial landmark localization,” in IEEE International Conference onComputer Vision Workshops, 2011.

[18] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar,“Localizing parts of faces using a consensus of exemplars,” PatternAnalysis and Machine Intelligence, IEEE Transactions on, vol. 35,no. 12, pp. 2930–2940, 2013.

[19] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive fa-cial feature localization,” in Computer Vision–ECCV 2012. Springer,2012, pp. 679–692.

[20] X. Zhu and D. Ramanan, “Face detection, pose estimation, andlandmark localization in the wild,” in Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp.2879–2886.

[21] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, andM. Pantic, “300 faces in-the-wild challenge: Database and results,”Image and Vision Computing, vol. 47, pp. 3–18, 2016.

[22] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “Asemi-automatic methodology for facial landmark annotation,” in Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops, 2013, pp. 896–903.

[23] ——, “300 faces in-the-wild challenge: The first facial landmarklocalization challenge,” in Proceedings of the IEEE InternationalConference on Computer Vision Workshops, 2013, pp. 397–403.

[24] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sundatabase: Large-scale scene recognition from abbey to zoo,” in Com-puter vision and pattern recognition (CVPR), 2010 IEEE conferenceon. IEEE, 2010, pp. 3485–3492.

[25] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, andK. Keutzer, “Densenet: Implementing efficient convnet descriptorpyramids,” arXiv preprint arXiv:1404.1869, 2014.

[26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture forfast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

[27] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-likeenvironment for machine learning,” in BigLearn, NIPS Workshop, no.EPFL-CONF-192376, 2011.

[28] S. Martin, K. Yuen, and M. M. Trivedi, “Vision for intelligent vehicles& applications (viva): Face detection and head pose challenge,” in2016 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2016, pp.1010–1014.

[29] P. Viola and M. J. Jones, “Robust real-time face detection,” Interna-tional journal of computer vision, 2004.

[30] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramidsfor object detection,” Pattern Analysis and Machine Intelligence, IEEETransactions on, 2014.

[31] X. Zhu and D. Ramanan, “Face detection, pose estimation, andlandmark localization in the wild,” in Computer Vision and PatternRecognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.

654