grindinger group wise similarity and classification of aggregate scanpaths

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Group-Wise Similarity and Classification of Aggregate Scanpaths

Thomas Grindinger∗, Andrew T. DuchowskiSchool of Computing, Clemson University

Michael SawyerIndustrial Engineering, Clemson University

Figure 1: Typical scanpath visualization at left. Time-projected scanpath visualization at right, where the y-axis denotes vertical gazeposition and the x-axis denotes time. Fixation labels are common between the two. Vertical markers denote one-second intervals.

Abstract

We present a novel method for the measurement of the similaritybetween aggregates of scanpaths. This may be thought of as a so-lution to the “average scanpath” problem. As a by-product of thismethod, we derive a classifier for groups of scanpaths drawn fromvarious classes. This capability is empirically demonstrated usingdata gathered from an experiment in an attempt to automaticallydetermine expert/novice classification for a set of visual tasks.

CR Categories: J.4 [Computer Applications]: Social and Behav-ioral Sciences—Psychology. I.2 [Pattern Recognition]: Models—Statistical.

Keywords: eye tracking, scanpath comparison, classification

1 Introduction

Scanpath comparison is a topic of growing interest. Methods havebeen proposed that allow comparison of two scanpaths. Less workhas been done on the comparison of one or more scanpaths to differ-ent groups of scanpaths. This is useful work, especially in light ofits potential in training environments, e.g., Sadasivan et al. [2005]demonstrated that expert scanpaths could be used as feedforwardinformation to guide novices in visual search. Our work providesa means of evaluating the ways in which different portions of anovice’s scanpath deviates from the expert’s.

Leigh and Zee [1991] discuss the implications of eye movementson the diagnosis and understanding of certain neurological disor-ders. Our work also has potential in marketing and film viewing

∗e-mail: [email protected]

analysis. For instance, a director could specify a “basis” scanpath,which the audience is expected to closely approximate. Our met-ric could measure how closely the audience conforms to that basis.The ability to perform aggregate scanpath similarity measurementand classification for each of these tasks is clearly needed.

2 Background

One of the foundational works in scanpath comparison was Priv-itera and Stark [2000]’s use of a string editing procedure to com-pare the sequential loci of scanpaths. Their approach does not dis-tinguish between fixations of different durations. Our approach issimilar, but at a finer granular level, allowing for comparison withgroups of scanpaths. Hembrooke et al. [2006] used a multiple se-quence alignment algorithm to create an average scan path for mul-tiple viewers, providing the functionality lacking in the previouswork. Unfortunately, their procedure was never explained in detail,and no objective results were provided.

Duchowski and McCormick [1998] described a visualization whichtracks fixations through time, referred to as “volumes of interest”.They were able to visualize multiple scanpaths in two and three di-mensions, using this temporal mapping. The former plots x or ycomponents on the y-axis and time on the x-axis, while the lattervisualizes fixations as three-dimensional, uniform-width volumes,where time serves as the third dimension. Raiha et al. [2005] de-scribed a similar visualization in two dimensions, with the slightdifference that fixations were displayed as variable-size circles,congruent with the typical visualization of scanpaths. Heatmap vi-sualizations, as described by Pomplun et al. [1996] and popularizedby Wooding [2002], overlay attentional information onto a stimu-lus as colors, where hot colors correspond to regions of high inter-est and cold (or no) colors correspond to regions of low interest.This representation is highly informative, yet does not provide anyquantitative information. We utilize the concept of heatmaps in ouralgorithm, but we do not aggregate them over time.

Our similarity measure resembles somewhat the Earth Mover’s Dis-tance used by Dempere-Marco et al. [2006] when considering cog-nitive processes underlying visual search of medical images. Theapproach is also similar to Galgani et al.’s [2009] effort to diagnose

101

(a) (b)

(c) (d)

Figure 2: Collections of scanpaths of novice (a) and expert (c) pilots over a single stimulus. Time-projected scanpaths of novices (b) and ofexperts (d) can be considered side views of the three-dimensional data.

ADHD through eye tracking data. They created three classifiers,including a classifier based on Levenshtein distance, and discov-ered that Levenshtein’s gave the best results among their chosenalgorithms. To show relative improvement, we also compare theperformance of our algorithm to a similar Levenshtein classifier.

3 Group-Wise Similarity

Our algorithm takes as input two collections of fixation-filteredscanpaths. An example image is presented in Figure 2, displayed in2(a) with all novice scanpaths and in 2(c) with all expert scanpaths.From a simple visual examination, there is no obvious characteris-tic that stands out for either collection. A procedure is then neededto perform a deeper statistical analysis of each collection.

The original impetus for this approach was the desire to formulatean elegant scanpath comparison measure for dynamic stimuli, suchas movies or interactive tasks. Current string-editing approachesare not sufficient for video. For example, a string-editing alignmentcould mistakenly align AOIs from frames that are many secondsapart. There is nothing to explicitly constrain AOIs to only coincidewithin specific temporal limits.

From the perspective of a collection of movie frames, each framecan be thought of as a separate stimulus. The scanpath for a sin-gle subject, viewing a movie stimulus, can then be broken up intoa collection of fixation-frame units, which are more or less inde-pendent from each other. This conceptualization of a scanpath dif-fers from the conventional view, in that the conventional visualiza-tion is a “projection” of fixations over time onto a two-dimensionalplane. Our conceptualization avoids this projection entirely. Thus,we produce a three-dimensional “scanpath function”. Given some

scanpath s and time t, the fixation function, f(s, t), produces eitherthe fixation attributable at that timestamp, e.g., frame, or null (forsaccades) from scanpath s. Figure 1 visualizes the difference be-tween the standard scanpath representation and a side view of thethree-dimensional representation.

We extend the above definition to the function f(S, t) by chang-ing the single scanpath parameter s to a collection of scanpathsS. This function would then return a collection of fixations forall scanpaths in S at the given timestamp. Then, we may differen-tiate groups of subjects into their own scanpath sets. For instance,in our experiment, we study the differences between experts andnovices. We may then create an expert scanpath set E and a novicescanpath set N . The functions f(E, t) and f(N, t) would then re-turn collections of fixations at timestamp t for experts and novices,respectively (Figures 2(b) and 2(d) visualize the same data as inFigures 2(a) and 2(c), but as side views of their three-dimensionalrepresentations).

These group-specific collections of fixations for single frames maybe clustered by the mean shift approach described by Santella andDeCarlo [2004]. The resulting clusters serve as general AOIs for agiven frame, describing regions of varying interest for that specificgroup of individuals. We may then construct a probabilistic modelof expected attention for that group. Such a model for a singleframe is visualized in Figure 3.

Each frame will have a separate model associated with it, and wemay calculate the “error per group” of a given fixation in a frame bycalculating the summation of the Gaussian distances from the fixa-tion point to all group-specific cluster centers. We use a Gaussiankernel with standard deviation of 50 pixels to determine the dis-

102

Figure 3: Mixture of Gaussians for expert fixations at a discretetimestamp. Displayed novice fixations were not used in the clus-tering operation. Note that the fixation labeled ‘A’ is far from thecluster centers, and thus has lower similarity than fixation labeled‘B’ that is close to a cluster center.

tance value, which we then invert. Thus, a fixation point collocatedwith a cluster mean or centroid has inverse distance value (similar-ity) of 1.0, and a fixation point more than 50 pixels away from thecluster mean has inverse distance value close to 0. The summationof the cluster similarities for a single fixation point are divided bythe number of clusters, giving a value between 0 and 1.

With a mechanism to evaluate group-specific error, or rather simi-larity, of fixation points in individual frames, we may then extrap-olate this process over the entire scanpath duration by summing in-dividual similarities for each frame and then returning the average.Thus, a scanpath in which most fixation points lie near to group-specific clusters will have similarity close to 1.0 for that group,while a scanpath in which most fixations points lie far away fromthose clusters will have similarity close to 0. This metric may thenbe extrapolated further to describe the similarity of one group ofscanpaths to another by simply averaging together the group-wisesimilarities for each scanpath in one group to the entire other group.

The data collected for expert/novice classification purposes did not,in fact, use video as stimulus. Nevertheless, while the video-basedapproach is expected to be more reliable for video, its applicationto static images would also be beneficial. In concordance with thevideo paradigm, we take samples from our data every 16 millisec-onds. Thus, this procedure may be utilized for analysis over bothstatic and dynamic stimuli. In our study, recorded scanpaths are ofvarious lengths. We must, therefore, specify a time window overwhich to collect fixation data. The upper bound on the length ofthis window is the shorter of either the length of the scanpath beingcompared or the mean of the scanpath lengths for a given stimulus.

To evaluate the capabilities of this new approach, we compared theresults to a group-wise extension of pairwise string-editing similar-ity. The group-wise string-editing similarity of a single scanpathto a group of scanpaths is the average pairwise similarity of thatscanpath to each scanpath in the group it is being compared to.

4 Classification

Our method of group-wise scanpath similarity is validated by amachine learning validation approach. Machine learning, specifi-cally classification, is a statistical framework which takes, as input,one or more groups of data and produces, as output, probabilityvalues that describe the likelihood that some arbitrary datum is amember of one or more of the defined groups. Thus, we may use

this approach to validate whether our group-wise similarity mea-sure produces information that may be used to reliably discriminatebetween groups. A classifier must be constructed for each group,e.g., the expert group and the novice group. The classifier for expertdata will be described below. The classifier for novice data may beconstructed identically, though with different input values.

As input to our classifier, we provide a list of group-wise similarityscores, corresponding to the similarities of individual scanpaths tothe expert model, as described above. The goal of the expert clas-sifier, then, is to determine some similarity threshold score, abovewhich indicates that a given scanpath is likely to be expert and be-low which indicates that the scanpath is unlikely to be expert.

We use the receiver operating characteristic (ROC) curve to findthis threshold. A thorough description of the curve may be found inFogarty et al. [2005]. This curve may also be used to compute thearea under the ROC curve (AUC). This value describes the discrim-inative ability of a classifier. Simple percentage accuracy valuesmay be misrepresentative, especially in skewed cases, such as hav-ing a large quantity of data from one class and a small quantity ofdata from another. The AUC value describes the probability that anindividual instance of one class will be classified differently froman instance of another class.

Two classifiers are being trained: an expert and a novice classifier.This means that two scores are produced for a single instance. Eachscore describes the probability that an instance is a member of theexpert or novice group, respectively. In order to decide which classthis instance conclusively belongs to, we use a heuristic. Thereare a few possibilities for the arrangement of these scores. First,the expert score may be higher than the expert threshold, and thenovice score may be lower than the novice threshold. This caseis trivially expert. Similarly, an instance with expert score lowerthan the expert threshold and novice score higher than the novicethreshold is trivially novice. In the case of both scores being aboveor below their respective thresholds, we divide the score of eachclassifier by its threshold value and choose the greater of the two.

5 Results

In order to evaluate our method we analyzed the results of a studywherein 20 high-time pilots (experts) and 20 non-pilots (novices)were presented with 20 different images of weather. Subjectswere asked to determine whether they would continue their currentflight path or if they needed to divert. Their eye movements wererecorded by a Tobii ET-1750 eye tracker (their verbal responseswere ignored in our analysis). Our objective was to produce a clas-sifier that can predict whether a subject is expert or novice, basedsolely on their eye movements.

With two classes, a random classifier would be expected to produce0.50 accuracy and AUC values. Evaluation metrics for our mech-anism are listed in Table 1. In our evaluation, we refer to expertdata as our positive class and novice data as negative. According tothe p-values, all metrics are significantly higher than random for ourmethod, while only the accuracy and AUC for the positive classifierare significantly higher for the string-editing method.

Results show the classifier’s discriminative ability over a singlestimulus. Given multiple stimuli, our measure is extrapolated overall stimuli for each subject. A “majority vote” is then used, whereone vote is drawn from each stimulus. If more than half the votesindicate that a subject is expert, that subject is then classified asconclusively expert. Otherwise, a subject is classified as novice.Accuracies for this voting mechanism are listed in Table 2.

103

Cross-Validation ResultsposAcc negAcc totAcc posAUC negAUC

TemporalAverage 0.71 0.64 0.68 0.85 0.86Std Dev 0.07 0.12 0.07 0.07 0.04Median 0.74 0.66 0.68 0.87 0.86p-value 0.00 0.01 0.00 0.00 0.00String-editingAverage 0.49 0.64 0.57 0.81 0.72Std Dev 0.17 0.13 0.06 0.06 0.11Median 0.48 0.64 0.57 0.82 0.71p-value 0.98 0.02 0.16 0.00 0.00

Table 1: Results of classification cross-validation for both the newtemporal method and string-editing similarity. Columns are accu-racy of positive (expert) and negative (novice) instances, total com-bined accuracy, and AUC values for positive and negative classifi-cation. P-values are results of t-test for significance of score distri-butions against a random distribution.

Subject ResultsTemporal String-editing

Experts Novices Experts NovicesAverage 0.68 0.35 0.45 0.34

Accuracy 85% 95% 40% 80%

Table 2: Results of cross-stimulus validation. Accuracy is deter-mined by counting the number of experts/novices with expert ratiogreater than 0.5 in the case of experts and less than or equal to 0.5in the case of novices.

6 Discussion

The AUC values listed in Table 1 show stronger discriminative abil-ity than a measure based on string-editing. P-values from t-testsindicate that the results of our new method are significantly dif-ferent from random for all measures, while results of the string-editing method are only significant for novices and AUC values.The cross-stimulus results in Table 2 show that novice instances areconsistently easier to classify than expert instances, but the overallaccuracies are still quite high. 85% of the positive instances areproperly classified as experts, while 95% of the negative instancesare classified as novice. This is an improvement over string-editing,with 40% positive accuracy and 80% negative accuracy.

The average accuracies in the cross-stimulus table may be inter-preted as the cross-validated similarity of each class to the expertclass. The group-wise similarity of experts to the expert class is0.68, while the group-wise similarity of novices to the expert classis 0.35. The experts’ similarity is above 0.5, while the novices’ isbelow 0.5, which is appropriate and intuitive, though one might ex-pect the similarity of a class with itself to be closer to 1.0. In thiscase, though, since we are cross-validating our results, we are not somuch measuring the similarity between a group and itself, but mea-suring the average similarity between members of the same class.In the case of measuring the similarity of different classes, though,such as comparing the novice class to the expert class, the intuitiveidea of group-wise similarity is more appropriate and convenient.

7 Conclusion

A group-wise scanpath similarity measure and classification al-gorithm have been described, allowing analysis and discrimina-tion of groups of scanpaths, based on any informative grouping of

those scanpaths. This mechanism has been empirically and statis-tically validated, showing that it is capable of discriminating be-tween groupings at least as diverse as expert/novice subject appel-lation, with greater accuracy and reliability than random. Potentialapplications include training environments, neurological disorderdiagnosis, and, in general, evaluation of attention deviation fromthat expected or desired during a dynamic stimulus. Future workmay include pre-alignment of unclassified scanpaths with classifiedscanpaths, attempting to increase the accuracy further during thecalculation of class similarity.

References

DEMPERE-MARCO, L., HU, X.-P., ELLIS, S. M., HANSELL,D. M., AND YANG, G.-Z. 2006. Analysis of Visual Search Pat-terns With EMD Metric in Normalized Anatomical Space. IEEETransactions on Medical Imaging 25, 8 (August), 1011–1021.

DUCHOWSKI, A. T. AND MCCORMICK, B. H. 1998. Gaze-Contingent Video Resolution Degradation. In Human Vision andElectronic Imaging III. SPIE, Bellingham, WA.

FOGARTY, J., BAKER, R. S., AND HUDSON, S. E. 2005. Casestudies in the use of ROC curve analysis for sensor-based esti-mates in human computer interaction. In GI ’05: Proceedingsof Graphics Interface 2005. Canadian Human-Computer Com-munications Society, School of Computer Science, University ofWaterloo, Waterloo, Ontario, Canada, 129–136.

GALGANI, F., SUN, Y., LANZI, P., AND LEIGH, J. 2009. Au-tomatic analysis of eye tracking data for medical diagnosis. InProceedings of IEEE Symposium on Computational Intelligenceand Data Mining (IEEE CIDM 2009). IEEE.

HEMBROOKE, H., FEUSNER, M., AND GAY, G. 2006. Averag-ing Scan Patterns and What They Can Tell Us. In Eye TrackingResearch & Applications (ETRA) Symposium. ACM, San Diego,CA, 41.

LEIGH, R. J. AND ZEE, D. S. 1991. The Neurology of Eye Move-ments, 2nd ed. Contemporary Neurology Series. F. A. DavisCompany, Philadelphia, PA.

POMPLUN, M., RITTER, H., AND VELICHKOVSKY, B. 1996. Dis-ambiguating Complex Visual Information: Towards Communi-cation of Personal Views of a Scene. Perception 25, 8, 931–948.

PRIVITERA, C. M. AND STARK, L. W. 2000. Algorithms forDefining Visual Regions-of-Interest: Comparison with Eye Fix-ations. IEEE Transactions on Pattern Analysis and Machine In-telligence (PAMI) 22, 9, 970–982.

RAIHA, K.-J., AULA, A., MAJARANTA, P., RANTALA, H., ANDKOIVUNEN, K. 2005. Static Visualization of Temporal Eye-Tracking Data. In INTERACT. IFIP, 946–949.

SADASIVAN, S., GREENSTEIN, J. S., GRAMOPADHYE, A. K.,AND DUCHOWSKI, A. T. 2005. Use of Eye Movements as Feed-forward Training for a Synthetic Aircraft Inspection Task. InProceedings of ACM CHI 2005 Conference on Human Factorsin Computing Systems. ACM Press, Portland, OR, 141–149.

SANTELLA, A. AND DECARLO, D. 2004. Robust Clustering ofEye Movement Recordings for Quantification of Visual Interest.In Eye Tracking Research & Applications (ETRA) Symposium.ACM, San Antonio, TX, 27–34.

WOODING, D. 2002. Fixation Maps: Quantifying Eye-MovementTraces. In Eye Tracking Research & Applications (ETRA) Sym-posium. ACM, New Orleans, LA.

104

grindinger group wise similarity and classification of aggregate scanpaths

Documents