gaze fixations and dynamics for behavior modeling and...

Gaze Fixations and Dynamics for Behavior Modeling and Predictionof On-road Driving Maneuvers

Sujitha Martin and Mohan M. Trivedi

Abstract— From driver assistance in manual mode to take-over requests in highly automated mode, knowing the stateof driver (e.g. sleeping, distracted, attentive) is critical forsafe, comfortable and stress-free driving. Since driving is avisually demanding task, driver’s gaze is especially important inestimating the state of driver; it has the potential to derive whatthe driver has attended to or is attending to and predict futureactions. We developed a machine vision based framework tomodel driver’s behavior by representing the gave dynamics overa time period using gaze fixations and transition frequencies.As a use case, we explore the driver’s gaze patterns duringmaneuvers executed in freeway driving, namely, left lane changemaneuver, right lane change maneuver and lane keep. It isshown that mapping gaze dynamics to gaze fixations andtransition frequencies leads to recurring patterns based ondriver activities. Furthermore, using data from on-road driving,we show that modeling these patterns show predictive powersin on-road driving maneuver detection around a few hundredmilliseconds a priori.

I. INTRODUCTION

Intelligent vehicles of the future are that which, havinga holistic perception (i.e. inside, outside and of the vehi-cle) and understanding of the driving environment, makeit possible for occupants to go from point A to point Bsafely, comfortably and in a timely manner [1], [2]. Thismay happen with the human driver in full control and gettingactive assistance from the robot, or the robot is in partial orfull control and human drivers are passive observers readyto take over as deemed necessary by the machine or humans[3], [4]. Therefore, the future of intelligent vehicles lies inthe collaboration of two intelligent systems, one robot andanother human .

Driver’s gaze is of particular interest because if and howthe driver is monitoring the driving environment is vitalfor driver assistance in manual model and for take-overrequests in highly automated mode for safe, comfortable andstress-free driving [5], [6]. Literary works have addressed theproblem of estimating driver’s awareness of the surround ina few different ways. Ahlstrom et al. [7], with an underlyingassumption that the driver’s attention is directed to the sameobject as the gaze, developed a rule based 2-second ‘attentionbuffer’ which depleted when driver looked away from thefield relevant to driving (FRD); and it starts filing up whenthe gaze direction is redirected toward FRD. One of thereasons for a 2-second buffer is because eyes off the roadfor more than 2 seconds significantly increases the risk ofa collision by at least two times that of normal, baseline

The authors are with the Laboratory for Intelligent and Safe Automo-biles, University of California San Diego, La Jolla, CA 92093 USA: (seehttp://cvrr.ucsd.edu/).

Fig. 1. Illustrates context based gaze fixations and transition frequenciesof interest for modeling driver behavior and predicting driver state/action.

driving [8]. Tawari et al. [9], on the other hand, developeda framework for estimating the driver’s focus of attentionby simultaneously observing the driver and the driver’s fieldof view. Specifically, the work proposed to associate coarseeye position with saliency of the scene to understand on whatobject the driver is focused at any given moment. Li et al.[10], under the assumption that mirror-checking behaviorsare strongly correlated with driver’s situation awareness,showed that the frequency and duration of mirror-checkingreduced during secondary task performance versus normal,baseline driving. Furthermore, mirror-checking actions wereused as features (e.g. binary labels indicating presence ofmirror checking, frequency and duration of mirror check-ing) in driving classification problems (i.e. normal ver-sus tasks/maneuver recognition). However, the classificationproblem had as it’s input, features from CAN signal andexternal cameras, whereas the classification and recognitionproblem addressed in this work is purely based on looking atthe driver’s gaze. Work by Birrell and Fowkes [11] is mostsimilar to our work in terms of using glance duration andtransition frequencies; however it differs in its definition ofrepresentation, in its study on the effects of using in-vehiclesmart driving aid and in its lack of predicting modeling.

In this work, we develop a vision based system to modeldriver behavior and predict maneuvers from gaze dynamicsalone. One of the key aspects in modeling driver behavioris in the representation of driver gaze dynamics using gazefixations and transition frequencies; this can be likened tosmoothing out high frequency noise and retaining the funda-mental information. For example, when interacting with thecenter stack (e.g. radio, AC, navigation), one may performthe secondary task with one long glance away from theforward driving direction and at the center stack. Anotherpossibility is multiple short glances towards the center stack,etc. However, while individual gaze patterns are different,together they are associated with an action of interest andtherefore share common characteristics at some level ofabstraction.

IEEE Intelligent Vehicles Symposium, June 2017

TABLE IDESCRIPTION OF ANALYZED ON-ROAD DRIVING DATA.

Duration No. of EventsTrip Full drive Left Lane Right Lane LaneNo. [min] Change Change Keeping1 81.2 11 16 1042 88.5 14 15 923 89.6 16 17 1124 93.9 14 15 179

All 353.3 55 49 487

The contributions of this work are two fold. First is theautomatic gaze dynamics analyzer which takes a video se-quence as input and outputs context based gaze fixations andtransition frequencies (as illustrated in Fig. 1). Second is inthe modeling of gaze dynamics for activity prediction. Lastly,using data from on-road driving, we show quantitatively theability to predict maneuvers reliably around a few hundredmilliseconds in advance.

II. NATURALISTIC DRIVING DATASET

A large corpus of naturalistic driving dataset is collectedusing subjects’ personal vehicles over the span of six months.Individual vehicles are instrumented with four Hero4 GoProcameras: two for looking at the driver’s face, one for lookingat the driver’s hands and one for looking at the forwarddriving direction. In this study, the focus is in analyzing thedriver’s face, while the forward view provides context fordata mining; the hand looking camera is instrumented forfuture studies of holistic modeling of driver behavior.

All cameras are configured to capture data with 1080presolution at 30 frames per second (fps). With exact cameraconfiguration and similar camera placements, the subjectscaptured data using their instrumented personal vehiclesduring their long commutes. The drives consisted of somedriving in urban settings, but mostly in freeway settingswith lanes ranging from a minimum of two up to six lanes.The data is especially collected in their personal vehicles inorder to retain natural interaction with vehicle accessoriesand during the subjects’ usual commutes in order to studydriver behavior under some constraints (e.g. familiar travelroutes) with certain variables (e.g. traffic conditions, time ofday/week/month).

This study analyzes data from a subset of the large corpus,to be exact four drives, whose details are as given in TableI. From the collected data, with a special focus on freewaysettings, events were selected when the driver executes a lanechange maneuver, by either changing left or right, and whenthe driver keeps the lane. Table I shows the events consideredand their respective counts.

III. METHODOLOGY

This section describes a vision based system to modeldriver behavior from gaze dynamics and predict maneuvers.There are three main components: gaze zone estimation, gazedynamics representation and gaze-based behavior modeling.

A. Gaze-zone Estimation

Two popular approaches to gaze zone estimation are anend-to-end CNN [12] and building blocks leading up tohigher semantic information [13], [14]. The latter approachis employed here because when designing each of thesemantic modules leading up to the gaze zone estimator,these intermediate representations of gaze can be used forother studies, such as tactical maneuver prediction [15] andpedestrian intent detection [16]. Key modules in our gazezone estimation system include, face detection using deepconvolutional neural networks [17], landmark estimationfrom cascaded regression models [18], [19], head pose fromrelative configuration of 2-D points in the image plane to3-D points in the head model [20], horizontal gaze surrogatebased on geometrical formulation of the eye ball and irisposition [13], vertical gaze surrogate based on openness ofthe upper eye lids [21] and appearance descriptor, and finally,a 9-class gaze zone estimation from naturalistic driving datadriven random forest algorithm. While the focus of this workis in

Let the vector G = [g1, g2, . . . , gN ] represent the esti-mated gaze for an arbitrary time period of T , where N =fps(frames per second)× T , gn ∈ Z, n ∈ {1, 2, . . . , N} andZ represent the set of all gaze zones of interest as Z ={LeftShoulder, Left, Front, Speedometer, Rearview, FrontRight, Center Stack, Right, EyesClosed}. Figure 2 illustratessample output time segments of the gaze zone estimator.It illustrates multiple 10-second time segments prior to thestart of lane change, two from left lane change and twofrom right lane change. In the figure, the x-axis representstime and color displayed at a given time t represents theestimated gaze zone; let SyncF denote the time when thetire touches the lane marking before crossing into the nextlane, which is the ”0-seconds” displayed in the figure. Notehow, prior to the lane change, there is some consistencyobserved across the different time segments within a givenevent (e.g. left lane change); consistencies such as the totalgaze fixations and gaze transitions between gaze zones. Inthe next section, we define a temporal sequence of gazedynamics using gaze fixations and transition frequencies,which remove some temporal dependencies but still capturesufficient spatio-temporal information to distinguish betweendifferent gaze behaviors.

B. Gaze Dynamics Representation

Gaze fixation is a function of the gaze zone. Given agaze zone, gaze fixation for that gaze zone is the amountof time driver spends looking at the gaze zone within atime period; which is then normalized by the time windowfor relative duration calculation. Glance duration then iscalculated for each of the gaze zones, zj , where zj ∈ Zand j ∈ {1, 2, ...,M}, as follows:

Glance Fixation(zj) =1

N×

N∑n=1

1(gi == zj)

where 1(·) is an indicator function.


(a) Two Left Lane Change Events (b) Two Right Lane Change EventsFig. 2. Illustrates four different scanpaths during a 10-second time window prior to lane change, two scanpaths during left lane change and two scanpathsduring right lane change event, with sample face images from various gaze zones. Consistencies such as total glance duration and number of glances toregions of interest within a time window are useful to note when describing the nature of driver’s gaze behavior. Such consistencies can be used as featuresto predict behaviors.

Gaze transition frequency is a function of two gaze zones.Given two gaze zones, gaze transition frequency is thenumber of times the driver glanced from one gaze zone to theother sequentially, in a time period; this value is normalizedby the time window for determining transition frequencywith respect to time. The matrix FGT is a transition fre-quency matrix, which means the diagonals are by definition0. In order to remove the order of transition, FGT is firstdecomposed into upper and lower triangular matrices. Thelower triangular matrix is transposed and summed togetherwith the upper triangular matrix, and then normalized toproduce the new glance transition matrix:

F ′GT (d, k) =1

N× fps×

{0 d ≤ kfdk + fkd d > k

where fdk is the number of transitions from the gaze zonerepresenting the dth column to the gaze zone representingthe kth column.

The final feature vector, ~h, is composed of gaze fixationscomputed for every gaze zone of interest and upper triangularmatrix of the new glance transition frequency matrix, F ′GT invectorized the form over a time window of gaze dynamics.

C. Gaze-based Behavior Modeling

Consider a set of feature vectors ~H = {~h1,~h2, . . . ,~hN},and their corresponding class labels Y = {y1, y2, . . . , yN}.For instance, the class labels can be: Left Lane Change, RightLane Change, Merge, Secondary Task. Given ~H and Y , wecompute the mean of the feature vectors within each class.Then, we model the gaze behaviors of respective events,tasks or maneuvers, using a multivariate normal distribution(MVN). A unnormalized MVN is trained for each behaviorsof interest:

Mb(~h) = exp

(−1

2(~h− ~µb)

T Σ−1b (~h− ~µb)

)where b ∈ B = {Left Lane Change, Right Lane Change,

LaneKeep}, and µb and Σb represent the mean and covari-ance computed from the training features vectors for the gaze

behavior represented by b. One of the reasons for modelinggaze behavior is, given a new test scanpath ~htest, we want toknow how does it compare to the average scanpath computedfor each gaze behavior in the training corpus. One possibilityis by taking the euclidean distance between the averagescanpath, µb, and the test scanpath, ~htest, for all b ∈ Band assign the label with the shortest distance. However,this assigns equal weight or penalty to each component of~h. We want to the weights to be a function of componentas well the behavior under consideration. Therefore, we usethe Mahalanobis distance, which is the component in theexponent of the unnormalized MVN. By exponentiating theMahalanobis distance, the range is mapped between 0 and1. To a degree this can be used to asses the probability orconfidence that a certain test scanpath, ~htest belongs to aparticular gaze behavior model.

IV. ON-ROAD PERFORMANCE EVALUATION

Every instance of driving on a freeway can be broken orcategorized exclusively into one of these three categories, leftlane change, right lane change and lane keep. As a point ofsynchronization, for lane change events, when the vehicle ishalf in the source and half in the destination it is marked atannotation. A time window of t-seconds before this instancedefines the left and right lane change events, where t is variedfrom 10 to 0. For the lane keeping events, a lengthy stretch oflane keeping is broken into non-overlapping respective t-sectime windows to create lane keeping events. Table I containsthe number of such events annotated and considered for thefollowing analysis.

All evaluations conducted in this study is done with a four-fold cross validation; four because there are four differentdrives as outlined in Table I. Using the popular leave one outcross-validation, training is done with events from leavingout one drive and tested with events from the remainingdrive. With this setup of separating the training and testingsamples, we explore the precision and recall of the gazebehavior model in predicting lane changes as a function oftime.


TABLE IITHE RECALL AND PRECISION OF LANE CHANGE PREDICTION (AVERAGED OVER MULTIPLE RUNS IN NATURALISTIC DRIVING SCENARIOS, 88 MIN

EACH) VIA GAZE BEHAVIOR MODELING USING MULTIVARIATE GAUSSIAN.Time before maneuver Left Lane Change Prediction Right Lane Change Prediction

Frame Milliseconds Precision Recall Precision Recall60 2000 0.4130 0.4318 0.5111 0.500054 1800 0.4490 0.5000 0.5714 0.608748 1600 0.4912 0.6364 0.5625 0.587042 1400 0.5484 0.7727 0.5800 0.630436 1200 0.5397 0.7727 0.5882 0.652230 1000 0.5692 0.8409 0.6111 0.717424 800 0.5758 0.8636 0.6316 0.782618 600 0.5672 0.8636 0.6500 0.847812 400 0.5672 0.8636 0.6500 0.84786 200 0.5735 0.8864 0.6500 0.84780 0 0.5857 0.9318 0.6500 0.8478

Training occurs on the 5-second time window beforeSyncF as defined in Section III-A. While testing, however,we want to test how early the gaze behavior models are ableto predict lane change. Therefore, starting from 5-secondsbefore SyncF, sequential samples with 1

30 of a second overlapare extracted up to 5-seconds after SyncF; note that the timewindow at 5-seconds before the SyncF encompasses datafrom 10 seconds before the SyncF up to 5-seconds beforethe SyncF. Each of the samples are tested for fitness acrossthe three gaze behavior models, namely models for left lanechange, right lane change and lane keep. The sample isassigned the label based on the model which procures thehighest fitness score and if the label matches the true labelthe sample is considered a true positive. Note that each testsample is associated with a time index of where it is sampledfrom with respect to SyncF. By gathering samples at the sametime index with respect to SyncF from drives not includedin the training set, precision and recall values are calculatedas a function of the time index.

When calculating precision and recall values, true labelsof samples were remapped from three classes to two classes;for instance, when computing precision and recall values forleft lane change prediction, all right lane change events andlane keep events were considered negatives samples and onlythe left lane change events are considered positive samples.Similar procedure is observed for computing precision andrecall values for right lane change prediction. Table II showsthe development of the precision-recall values for both leftand right lane change prediction in an interval of 200milliseconds starting from 2000 milliseconds prior to SyncFup to 0 milliseconds prior to SyncF. Interestingly, there is aplateauing effect in recall for both left and right lane changeprediction around a 600 milliseconds before the event. Onepossibility is that there is a strong indication at that timeindex of intended lane change using gaze analysis only.Future works in experimenting with the time window formodeling and testing will reveal more information.

With respect to the precision rate, the values are notexpected to be as high because even during lane keeping,drivers may exhibit lane change like behavior without lanechanging. This is especially observed with the precision ratefor left lane change prediction, where checking left rear viewmirror is a strong indicator of lane change but also part of

driver’s normal mirror checking behavior during lane keep.One of the main cause is the gaze model for lane keep,which encompasses a broad spectrum of driving behavior andfuture work will consider finer and more separated labelingof classes.

One of the motivations behind this work is to estimate thedriver’s readiness to take over in highly automated vehicles.Situations where system cannot handle thus requiring take-over include system boundaries due to sensor limitation andambiguous environment. In such situations, looking insideat the state of the driver and how much time is requiredto reach readiness to take-over is important. Therefore, inthis study we developed a framework to estimate his or herreadiness to handle the situation by modeling gaze behaviorfrom non-automated naturalistic driving data. In particular,gaze behavior during lane changes and lane keep is modeled.Figure 3 illustrates the fitness or confidence of the modelaround left and right lane change maneuvers. The figureshows mean (solid line) and standard deviation (semitrans-parent shades) of three models (i.e. left lane change, rightlane change, lane keep) for two different maneuvers (i.e. leftlane change and right lane change), using the events fromnaturalistic driving dataset described in Table I. The modelconfidence statistics are plotted 5 seconds before and after thelane change maneuver, where time of 0 seconds representswhen the vehicle is half way between the lanes. Interestingly,during left (right) lane change maneuver fitness of the right(left) lane change gaze model is very low within the 10second bracket and left (right) lane change model peaksin fitness in a tighter time window around the maneuver.Furthermore, for both maneuvers, the lane keep model asdesired is high in the periphery of the lane change maneuvershinting at how this model performs during actual lane keepsituations.

V. CONCLUDING REMARKS

In this study, we explored modeling driver’s gaze behaviorin order to predict maneuvers performed by drivers, namelyleft lane change, right lane change and lane keep. Theparticular model developed in this study features three majoraspects: one is the spatio-temporal features to represent thegaze dynamics, second is in defining the model as the averageof the observed instances and interpreting why such a modelfits the data of interest, third is in the design of the metric


(a) Left Lane Change Events (b) Right Lane Change events

Fig. 3. Illustrates the fitness of the three models (i.e. Left lane change, Right lane) during left and right lane change maneuvers. Mean (solid line) andstandard deviation (semitransparent shades) of the three models as applied for to the lane change events described in Table I.

for estimating fitness of model. Applying this frameworkin a sequential series of time windows around lane changemaneuvers, the gaze models are able to predict left and rightlane change maneuver with an accuracy above 85% around600 milliseconds before the maneuver.

The overall framework, however, is designed to modeldriver’s gaze behavior for any tasks or maneuvers performedby driver. To this end, there are multiple future directions insite. One is to quantitatively define the relationship betweenthe time window from which to extract those meaningfulspatio-temporal features and the task or maneuvers per-formed by driver. Another is in exploring and comparingdifferent modeling approaches, including HMM, LDCRF andmultiple kernely learning. Other future directions include ex-ploring unsupervised clustering of gaze behaviors, especiallyduring lane keep, and exploring the effects of quantity andquality of events (e.g. same vs. different drives, differentdrives from same or different time of day) on gaze behaviormodeling.

ACKNOWLEDGMENTThe authors would like to thank their colleagues at the

Laboratory of Intelligent and Safe Automoilbles (LISA)for their valuable comments, especially Sourabh Vora forhis assistance with data analysis. The authors gratefullyacknowledge the support of industry partners, especiallyFujitsu Ten and Fujitsu Laboratories of America.

REFERENCES

[1] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, andH. Winner, “Three decades of driver assistance systems: Reviewand future perspectives,” IEEE Intelligent Transportation SystemsMagazine, 2014.

[2] E. Ohn-Bar and M. M. Trivedi, “Looking at humans in the age ofself-driving and highly automated vehicles,” IEEE Transactions onIntelligent Vehicles, 2016.

[3] J. S. International, “Taxonomy and definitions for terms related toon-road motor vehicle autmated driving systems,” 2014.

[4] S. M. Casner, E. L. Hutchins, and D. Norman, “The challenges ofpartially automated driving,” Communications of the ACM, 2016.

[5] M. Korber and K. Bengler, “Potential individual differences regard-ing automation effects in automated driving,” in ACM InternationalConference on Human Computer Interaction, 2014.

[6] M. Aeberhard, S. Rauch, M. Bahram, G. Tanzmeister, J. Thomas,Y. Pilat, F. Homm, W. Huber, and N. Kaempchen, “Experience, resultsand lessons learned from automated driving on germany’s highways,”IEEE Intelligent Transportation Systems Magazine, 2015.

[7] C. Ahlstrom, K. Kircher, and A. Kircher, “A gaze-based driverdistraction warning system and its effect on visual behavior,” IEEETransactions on Intelligent Transportation Systems, 2013.

[8] S. G. Klauer, T. A. Dingus, V. L. Neale, J. D. Sudweeks, D. J. Ramseyet al., “The impact of driver inattention on near-crash/crash risk: Ananalysis using the 100-car naturalistic driving study data,” 2006.

[9] A. Tawari, A. Møgelmose, S. Martin, T. B. Moeslund, and M. M.Trivedi, “Attention estimation by simultaneous analysis of viewer andview,” in IEEE International Conference on Intelligent TransportationSystems (ITSC), 2014.

[10] N. Li and C. Busso, “Detecting drivers’ mirror-checking actions andits application to maneuver and secondary task recognition,” IEEETransactions on Intelligent Transportation Systems, 2016.

[11] S. A. Birrell and M. Fowkes, “Glance behaviours when using anin-vehicle smart driving aid: A real-world, on-road driving study,”Transportation research part F: traffic psychology and behaviour,2014.

[12] S. Vora, A. Rangesh, and M. M. Trivedi, “On generalizing drivergaze zone estimation using convolutional neural networks,” in IEEEIntelligent Vehicles Symposium, 2017.

[13] A. Tawari, K. H. Chen, and M. M. Trivedi, “Where is the driver look-ing: Analysis of head, eye and iris for robust gaze zone estimation,” inIEEE International Conference on Intelligent Transportation Systems(ITSC), 2014.

[14] L. Fridman, J. Lee, B. Reimer, and T. Victor, “Owl and lizard: Patternsof head pose and eye pose in driver gaze classification,” IET ComputerVision, 2016, In Print.

[15] A. Doshi and M. M. Trivedi, “Tactical driver behavior predictionand intent inference: A review,” in IEEE International Conferenceon Intelligent Transportation Systems (ITSC), 2011.

[16] A. Rangesh, E. Ohn-Bar, K. Yuen, and M. M. Trivedi, “Pedestriansand their phones-detecting phone-based activities of pedestrians forautonomous vehicles,” in Intelligent Transportation Systems (ITSC),IEEE International Conference on, 2016.

[17] K. Yuen, S. Martin, and M. Trivedi, “On looking at faces in anautomobile: Issues, algorithms and evaluation on naturalistic drivingdataset,” in IEEE International Conference on Pattern Recognition.Citeseer, 2016.

[18] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmarkestimation under occlusion,” in IEEE International Conference onComputer Vision, 2013.

[19] X. Xiong and F. De la Torre, “Supervised descent method and itsapplications to face alignment,” in Computer Vision and PatternRecognition (CVPR), 2013 IEEE Conference on. CVPR, 2013.

[20] S. Martin, A. Tawari, E. Murphy-Chutorian, S. Y. Cheng, andM. Trivedi, “On the design and evaluation of robust head pose forvisual user interfaces: algorithms, databases, and comparisons,” inACM International Conference on Automotive User Interfaces andInteractive Vehicular Applications, 2012.

[21] S. Martin, E. Ohn-Bar, A. Tawari, and M. M. Trivedi, “Understandinghead and hand activities and coordination in naturalistic drivingvideos,” in Intelligent Vehicles Symposium Proceedings. IEEE, 2014.


gaze fixations and dynamics for behavior modeling and...

Documents