Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks

Copyright © 2010 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail [email protected]. ETRA 2010, Austin, TX, March 22 – 24, 2010. © 2010 ACM 978-1-60558-994-7/10/0003 $10.00

Match-Moving for Area-Based Analysis of Eye Movements in Natural Tasks

Wayne J. RyanSchool of Computing

Andrew T. Duchowski ∗

School of ComputingEllen A. Vincent

Department of HorticultureDina Battisto

Department of Architecture

Clemson University

(a) Headgear. (b) Complete device. (c) User interface.

Figure 1: Our do-it-yourself wearable eye tracker (a) from off-the-shelf components (b) and the graphical user interface (c) featuring VCRcontrols for frame advancement and match-moving search boxes for dynamic object tracking.

AbstractAnalysis of recordings made by a wearable eye tracker is compli-cated by video stream synchronization, pupil coordinate mapping,eye movement analysis, and tracking of dynamic Areas Of Interest(AOIs) within the scene. In this paper a semi-automatic system isdeveloped to help automate these processes. Synchronization is ac-complished via side by side video playback control. A deformableeye template and calibration dot marker allow reliable initializationvia simple drag and drop as well as a user-friendly way to correctthe algorithm when it fails. Specifically, drift may be corrected bynudging the detected pupil center to the appropriate coordinates. Ina case study, the impact of surrogate nature views on physiologicalhealth and perceived well-being is examined via analysis of gazeover images of nature. A match-moving methodology was devel-oped to track AOIs for this particular application but is applicabletoward similar future studies.

CR Categories: I.3.6 [Computer Graphics]: Methodology andTechniques—Ergonomics; J.4 [Computer Applications]: Socialand Behavioral Sciences—Psychology.

Keywords: eye tracking, match moving

1 IntroductionBuswell’s [1935] seminal exploration of eye gaze over complexscenes, e.g., photographs of paintings, patterns, architecture, andinterior design, helped influence development of techniques forrecording and analysis of human eye movements during perfor-mance of natural tasks. Wearable eye trackers allow collectionof eye movements in natural situations, usually involving the use

∗e-mail: {wryan |andrewd}@cs.clemson.edu

of generally unconstrained eye, head, and hand movements. Themost common eye tracking metrics sought include the number offixations, fixation durations, and number and duration of fixationsper Area Of Interest, or AOI, among several others [Megaw andRichardson 1979; Jacob and Karn 2003]. Longer fixations gener-ally indicate greater cognitive processing of the fixated area and thefixation and percentage of fixation time devoted to a particular areamay indicate its saliency [Webb and Renshaw 2008].

Complications in analysis arise in synchronization of the videostreams, mapping of eye position in the eye image frame to thepoint of gaze in the scene image frame, distinction of fixationsfrom saccades within the raw gaze point data stream, and determi-nation of the frame-to-frame location of dynamic AOIs within thescene video stream. Most previous work relied on manual videoframe alignment as well as manual (frame-by-frame) classificationof eye movements. In this paper, a semi-automatic system is de-veloped to help automate these processes, inspired by establishedcomputer graphics methods primarily employed in video composit-ing. The system consists of video synchronization, calibration dotand limbus tracking (as means of estimation of parameters for gazepoint mapping and of the pupil center, respectively), fixation detec-tion via a signal analysis approach independent of video frame rate,and AOI tracking for eventual statistical analysis of fixations withinAOIs. Tracking of the calibration dot and of AOIs is achieved byimplementation of a simple 2D variant of match-moving [Paolini2006], a technique used for tracking markers in film, primarilyto facilitate compositing of special effects (e.g., texture mappingcomputer-generated elements atop principal photography). The re-sult is a semi-automatic approach akin to keyframing to set the lo-cation of markers over scene elements in specific video frames andinbetweening their location coordinates (usually) by linear interpo-lation.

A case study is presented where eye movements are analyzedover images viewed in a hospital setting. The analysis is part ofan experiment conducted to better understand the potential healthbenefits of images of nature toward patient recovery. Although de-scriptive statistics of gaze locations over AOIs are underwhelming,the given methodology is applicable toward similar future studies.

235

2 BackgroundClassic work analyzing eye movements during performance of awell-learned task in a natural setting (making tea) aimed to deter-mine the pattern of fixations and to classify the types of monitoringaction that the eyes perform [Land et al. 1999]. A head-mountedeye-movement video camera was used that provided a continuousview of the scene ahead, with a dot indicating foveal direction withan accuracy of about 1◦. Results indicated that even automated rou-tine activities require a surprising level of continuous monitoring.Land et al. concluded that although the actions of tea-making are‘automated’ and proceed with little conscious involvement, the eyesclosely monitor every step of the process. This type of unconsciousattention must be a common phenomenon in everyday life.

Relations of eye and hand movements in extended food prepara-tion tasks have also been examined [Land and Hayhoe 2001]. Thetask of tea-making was compared to the task of making peanut but-ter and jelly sandwiches. In both cases the location of foveal gazewas monitored continuously using a head-mounted eye tracker withan accuracy of about 1◦, with the head free to move. The eyes usu-ally reached the next object in the task sequence before any signof manipulative action, indicating that eye movements are plannedinto the motor pattern and lead each action. Eye movements duringthis kind of task are nearly all to task-relevant objects, and thus theircontrol is seen primarily ‘top-down’, and influenced very little bythe ‘intrinsic salience’ of objects.

Examination of short-term memory in the course of a naturalhand-eye task [Ballard et al. 1995], showed employment of deic-tic primitives through serialization of the task with eye movements(e.g., using the eyes to “point to” scene objects in lieu of memo-rizing all of the objects’ positions and other properties). A headmounted eye tracker was used to measure eye movements over athree-dimensional physical workplace block display. By recordingeye movements during a block pick-and-place task, it was shownthat subjects frequently directed gaze to the model pattern beforearranging blocks in the workspace area. This suggests that infor-mation is acquired incrementally during the task and is not acquiredin toto at the beginning of the tasks. That is, subjects appeared touse short-term memory frugally, acquiring information just priorto its use, and did not appear to memorize the entire model blockconfiguration before making a copy of the block arrangement.

In a similar block-moving experiment, horizontal movements ofgaze, head, and hand were shown to follow a coordinated pattern[Smeets et al. 1996]. A shift of gaze was generally followed bya movement of the head, which preceded the movement of thehand. This relationship is to a large extent task-dependent. Ingoal-directed tasks in which future points of interest are highly pre-dictable, while gaze and head movements may decouple, the actualposition of the hand is a likely candidate for the next gaze shift.

Up to this point, while significant in its contributions to visionresearch, the analysis employed in the above examples was oftenbased manual inspection of video frames.

Relatively recently, intentionally-based, termed “look-ahead”,eye movements were reported [Pelz et al. 2000]. A commerciallyavailable wearable eye tracker from ASL was worn on the head witha computer carried in a backpack. Subsequently, a custom-builtwearable eye tracker was assembled with off-the-shelf components,initiating an open-source movement to develop practical eye track-ing software and hardware [Babcock and Pelz 2004]. Tips wereprovided on constructing the tracker, opening the door to open-source software development. This corneal reflection eye tracker,mainly constructed from inexpensive components (a Radio Shackparts list was made available at one time), was one of the first Do-It-Yourself eye trackers, but suffered from two significant problems.First, it required the inclusion of one expensive video component,namely a video multiplexer, used to synchronize the video feeds ofthe scene and eye cameras. Second, the system relied on somewhat

dated (by today’s standards) video recording equipment (a SonyDCR-TRV19 DVR). Nevertheless, the system served its purpose infostering a nascent open-source approach to (wearable) eye track-ing that is still influential today.

Recent work from the same research lab advanced the analyti-cal capability of the wearable eye tracker in two important ways[Munn et al. 2008]. First, fixation detection was used to analyzeraw eye movement data. Second, a method was presented for track-ing objects in the scene. Both developments are somewhat similarto what is presented in this paper, but with significant distinctions.First, the prior method of fixation detection is tied to the eye videoframe rate. In this paper we show that eye movement analysis is in-dependent of frame rate, insofar as it operates on the eye movementdata stream (x,y, t) where the timestamp of each gaze coordinateis encoded in the data tuple. This form of analysis is not new, infact the analysis code was originally developed commercially (byLC Technologies), made publicly available, and successfully usedin at least one instance [Freed 2003]. Second, the prior techniquefor scene object tracking uses structure from motion to compute 3Dinformation [Munn and Pelz 2008]. We show that such complexcomputation is unnecessary and a simple 2D translational tracker issufficient. This form of tracking, known as match-moving, is alsowell established in the practice of video compositing.

3 Technical DevelopmentHardware design follows the description of Li [2006], with mini-mal modifications. The apparatus is constructed entirely from in-expensive commercial off-the-shelf (COTS) components (see Fig-ures 1 and 2). The entire parts list for the device includes one pairof safety glasses (AOSafety X-Factor XF503), a more comfortablenose piece of a second pair of plastic sunglasses (AOSafety I-Riot90714), black polyester braided elastic for wrapping the wires, twoscrews to connect the scene camera bracket and nose piece, a smallaluminum or brass rod for mounting the eye camera, and two digitalvideo minicams.

The two digital video mini-camcorders used are the CamwearModel 200 from DejaView [Reich et al. 2004]. Each De-jaView wearable digital mini-camcorder uses the NW901 MPEG-4CODEC from Divio, Inc., enabling MPEG-4 video recording at 30fps. Each DejaView camera’s field of view subtends 60◦.

The DejaView camera is connected via flexible cable to therecorder box, which comes with a belt clip for hands-free use, butlacks an LCD display. Video is recorded on a 512MB SD minidisk. After recording, the video may be transferred to a computerfor offline processing. The DejaView mini-camcorders do not sup-port transfer of video while recording, precluding online process-ing. The lack of an LCD display also prevents verification of correctcamera positioning until after the recording is complete.

Figure 2: Eye tracker assembly [Ryan et al. 2008] © ACM 2008.

236

Figure 3: Screen flash for synchronization visible as eye reflection.

No IR illumination is used, simplifying the hardware and reduc-ing cost. The eye-tracker functions in environments with signifi-cant ambient IR illumination (e.g., outdoors on a sunny day, seeRyan et al. [2008]). However, lacking a stable corneal reflectionand visible spectrum filtering, video processing is more challeng-ing. Specular reflections often occlude the limbus and contrast atthe pupil boundary is inconsistent.

3.1 Stimulus for Video Processing

For video synchronization and calibration, a laptop computer isplaced in front of the participant. To synchronize the two videosa simple program that flashes the display several times is executed.Next, a roving dot is displayed for calibration purposes. The par-ticipant is asked to visually track the dot as it moves. The laptopdisplay is then flashed again to signify the end of calibration. Forgood calibration the laptop display should appear entirely within thescene image frame, and should span most of the frame. After cali-bration the laptop is moved away and the participant is free to viewthe scene normally. After a period of time (in this instance abouttwo minutes) the recording is stopped and video collection is com-plete. All subsequent processing is then carried out offline. Notethat during recording it is impossible to judge camera alignment.Poor camera alignment is the single greatest impediment towardsuccessful data processing.

3.2 Synchronization

Video processing begins with synchronization. Synchronizationis necessary because the two cameras might not begin recordingat precisely the same time. This situation would be alleviated ifthe cameras could be synchronized via hardware or software con-trol (e.g., via IEEE 1394 bus control). In the present case, nosuch mechanism was available. As suggested previously [Li andParkhurst 2006], a flash of light visible in both videos is used asa marker. Using the marker an offset necessary for proper framealignment is established. In order to find these marker locations inthe two video streams, they are both displayed side by side, eachwith its own playback control. The playback speed is adjustable inforward and reverse directions. Single frame advance is also pos-sible. To synchronize the videos, the playback controls are usedto manually advance/rewind each video to the last frame where thelight flash is visible (see Figure 3).

3.3 Calibration & Gaze Point Mapping

Pupil center coordinates are produced by a search algorithm exe-cuted over eye video frames. The goal is to map the pupil centerto gaze coordinates in the corresponding scene video frame. Cali-bration requires sequential viewing of a set of spatially distributedcalibration points with known scene coordinates. Once calibrationis complete the eye is tracked and gaze coordinates are computedfor the remainder of the video. A traditional video-oculography ap-proach [Pelz et al. 2000; Li et al. 2006] calculates the point of gaze

Figure 4: Initialization of pupil/limbus and dot tracking.

by mapping the pupil center (x,y) to scene coordinates (sx,sy) viaa second order polynomial [Morimoto and Mimica 2005],

sx = a0 +a1x+a2y+a3xy+a4x2 +a5y2

sy = b0 +b1x+b2y+b3xy+b4x2 +b5y2. (1)

The unknown parameters ak and bk are computed via least squaresfitting (e.g., see Lancaster and Šalkauskas [1986]).

3.3.1 Initialization of Pupil/Limbus and Dot Tracking

Pupil center in the eye video stream and calibration dot in the scenevideo stream are tracked by different local search algorithms, bothinitialized by manually positioning a template over recognizableeye features and a crosshair over the calibration dot. Grip boxesallow for adjustment of the eye template (see Figure 4). Duringinitialization, only one playback control is visible, controlling ad-vancement of both video streams. It may be necessary to advanceto the first frame with a clearly visible calibration dot. Subsequentsearches exploit temporal coherence by using the previous searchresult as the starting location.

3.3.2 Dot Tracking

A simple greedy algorithm is used to track the calibration dot. Theunderlying assumption is that the dot is a set of bright pixels sur-rounded by darker pixels (see Figure 4). The sum of differences islargest at a bright pixel surrounded by dark pixels. The dot movesfrom one location to the next in discrete steps determined by therefresh rate of the display. To the human eye this appears as smoothmotion, but in a single frame of video it appears as a short trail ofmultiple dots. To mitigate this effect the image is blurred with aGaussian smoothing function, increasing the algorithm’s toleranceto variations in dot size. In the present application the dot radiuswas roughly 3 to 5 pixels in the scene image frame.

The dot tracking algorithm begins with an assumed dot locationobtained from the previous frame of video, or from initialization. Asum of differences is evaluated over an 8×8 reference window:

∑i

∑j

I(x,y)− I(x− i,y− j), −8 < i, j < 8. (2)

This evaluation is repeated over a 5×5 search field centered at theassumed location (x,y). If the assumed location yields a maximumwithin the 25 pixel field then the algorithm stops. Otherwise thelocation with the highest sum of differences becomes the new as-sumed location and the computation is repeated.

One drawback of this approach is that the dot is not well trackednear the edge of the laptop display. Reducing the search field andreference window allows better discrimination between the dot anddisplay edges while reducing the tolerance to rapid dot movement.

3.4 Pupil/Limbus TrackingA two-step process is used to locate the limbus (iris-sclera bound-ary) and hence pupil center in an eye image. First, feature points are

237

(a) constrained ray origin and extent (b) constrained rays and fit ellipse

Figure 5: Constrained search for limbic feature points: (a) con-strained ray origin and termination point; (b) resultant rays, fittedellipse, and center. For clarity of presentation only 36 rays are dis-played in (b), in practice 360 feature points are identified.

detected. Second, an ellipse is fit to the feature points. The ellipsecenter is a good estimate of the pupil center.

3.4.1 Feature Detection

The purpose of feature detection is to identify point locations on thelimbus. We use a technique similar to Starburst [Li et al. 2005]. Acandidate feature point is found by casting a ray R away from anorigin point O and terminating the ray as it exits a dark region. Wedetermine if the ray is exiting a dark region by checking the gradi-ent magnitude collinear with the ray. The location with maximumcollinear gradient component max∇ is recorded as a feature point.Starburst used a fixed threshold value rather than the maximum anddid not constrain the length of the rays.

Consistent and accurate feature point identification and selectionis critical for stable and accurate eye-tracking. Erroneous featurepoints are often located at the edges of the pupil, eyelid, or at a spec-ular reflection. To mitigate these effects the feature point searcharea is constrained by further exploiting temporal coherence. Thelimbic boundary is not expected to move much from one frame tothe next, therefore it is assumed that feature points will be near theellipse E identified in the previous frame. If P is the intersection ofray R and ellipse E the search is constrained according to:

max∇(O+α(P−O) : 0.8 < α < 1.2), (3)

as depicted in Figure 5. For the first frame in the video we use theeye model manually aligned at initialization to determine P.

3.4.2 Ellipse Fitting and Evaluation

Ellipses are fit to the set of feature points using linear least squaresminimization (e.g., [Lancaster and Šalkauskas 1986]). This methodwill generate ellipses even during blinks when no valid ellipse isattainable. In order to detect these invalid ellipses we implementedan ellipse evaluation method.

Each pixel that the ellipse passes through is labeled as acceptableor not depending upon the magnitude and direction of the gradientat that pixel. The percentage of acceptable pixels is computed andincluded in the output as a confidence measure.

3.4.3 Recovery From Failure

The ellipse fitting algorithm occasionally fails to identify a validellipse due to blinks or other occlusions. Reliance on temporal co-herence can prevent the algorithm from recovering from such situ-ations. To mitigate this problem we incorporated both manual andautomatic recovery strategies. Automatic recovery relies on ellipseevaluation: if an ellipse evaluates poorly, it is not used to constrainthe search for feature points in the subsequent frame. Instead, werevert to using the radius of the eye model as determined at ini-tialization, in conjunction with the center of the last good ellipse.Sometimes this automatic recovery is insufficient to provide a goodfit, however. Manual recovery is provided by displaying each fitted

Figure 6: Display of fitted ellipse and computed gaze point.

ellipse on the screen. If the user observes drift in the computed el-lipse the center may be nudged to the correct location using a simpledrag and drop action.

These strategies are analogous to traditional keyframing opera-tions, e.g., when match-moving. If a feature tracker fails to tracka given pixel pattern, manual intervention is required at specificframes. The result is a semi-automatic combination of manualtrackbox positioning and automatic trackbox translation. Althoughnot as fast as a fully automatic approach, this is still considerablybetter than the fully manual, frame-by-frame alternative. A screen-shot of the user interface is shown in Figure 6.

3.4.4 Tracking Accuracy

The DejaView camera has approximately a 60◦ field of view, withvideo resolution of 320×240. Therefore a simple multiplication by0.1875 converts our measurement in pixels of Euclidean distancebetween gaze point and calibration coordinates to degrees visualangle. Using this metric, the eye tracker’s horizontal accuracy isbetter than 2◦, on average [Ryan et al. 2008]. Vertical and horizon-tal accuracy is roughly equivalent.

3.5 Fixation DetectionAfter mapping eye coordinates to scene coordinates via Equation(1), the collected gaze points and timestamp x = (x,y, t) are ana-lyzed to detect fixations in the data stream. Prior to this type ofanalysis, raw eye movement data is not very useful as it representsa conjugate eye movement signal, composed of a rapidly changingcomponent (generated by fast saccadic eye movements) with thecomparatively stationary component representative of fixations, theeye movements generally associated with cognitive processing.

There are two leading methods for detecting fixations in the raweye movement data stream: the position-variance or velocity-basedapproaches. The former defines fixations spatially, with centroidand variance indicating spatial distribution [Anliker 1976]. If thevariance of a given point is above some threshold, then that pointis considered outside of any fixation cluster and is considered tobe part of a saccade. The latter approach, which could be consid-ered a dual of the former, examines the velocity of a gaze point,e.g., via differential filtering, xi = 1

∆t ∑kj=0 xi+ jg j, i ∈ [0,n− k),

where k is the filter length, ∆t = k− i. A 2-tap filter with coeffi-cients g j = {1,−1}, while noisy, can produce acceptable results.The point xi is considered to be a saccade if the velocity xi is abovethreshold [Duchowski et al. 2002]. It is possible to combine thesemethods by either checking the two threshold detector outputs (e.g.,for agreement) or by deriving the state-probability estimates, e.g.,via Hidden Markov Models [Salvucci and Goldberg 2000].

In the present implementation, fixations are identified by a vari-ant of the position-variance approach, with a spatial deviationthreshold of 19 pixels and number of samples set to 10 (the fixa-tion analysis code is freely available on the web1). Note that this

1The position-variance fixation analysis code was originally made

238

Figure 7: AOI trackbox with corners labeled (A,B,C,D).

approach is independent of frame rate, so long as each gaze point islisted with its timestamp, unlike a previous approach where fixationdetection was tied to the video frame rate [Munn et al. 2008].

The sequence of detected fixations can be processed to gaininsight into the attentional deployment strategy employed by thewearer of the eye tracking apparatus. A common approach is tocount the number of fixations observed over given Areas Of Inter-est, or AOIs, in the scene. To do so in dynamic media, i.e., overvideo, it is necessary to track the AOIs as their apparent position inthe video translates due to camera movement.

3.6 Feature TrackingBy tracking the movement of individual features it is possible to ap-proximate the movement of identified AOIs. We allow the user toplace trackboxes at any desired feature in the scene. The trackboxthen follows the feature as it translates from frame to frame. Thisis similar in principle to the match-moving tracker window in com-mon compositing software packages (e.g., Apple’s Shake [Paolini2006]). Locations of trackboxes are written to the output data filealong with corresponding gaze coordinates. We then post-processthe data to compute fixation and AOI information from gazepointand trackbox data.

The user places a trackbox by clicking on the trackbox sym-bol, dragging and dropping it onto the desired feature. A user mayplace as many trackboxes as desired. For our study trackboxes wereplaced at the corners of each monitor.

Feature tracking is similar to that used for tracking the calibra-tion dot with some minor adaptations. Computation is reduced byprecomputing a summed area table S [Crow 1984]. The value ofany pixel in S stores the sum of all pixels above and to the left ofthe corresponding pixel in the original image,

S(x,y) = ∑i

∑j

I(i, j), 0 < i < x, 0 < j < y. (4)

Computation of the summation table is efficiently performed by adynamic programming approach (see Algorithm 1). The summa-tion table is then used to efficiently compute the average pixel value

available by LC Technologies. The original fixfunc.c canstill be found on Andrew R. Freed’s eye tracking web page:<http://freedville.com/professional/thesis/eyetrack-readme.html>. TheC++ interface and implementation ported from C by Mike Ashmore areavailable at: <http://andrewd.ces.clemson.edu/courses/cpsc412/fall08>.

for (y = 0 to h )sum = 0for (x = 0 to w )

sum = sum+ I(x,y)S(x,y) = sum+S(x,y−1)

Algorithm. 1: Single-pass computation of summation table.

θ

t3

t2

t1

x

A

D

GH

E

BC

F

I

Figure 8: Trackboxes t1, t2, t3, AOIs A, B, . . ., I, and fixation x.

within the reference window of the trackbox (see Figure 7). As indot tracking, a 5×5 search field is used within an 8×8 referencewindow. Equation (2) is now replaced with I(x,y)−µ , where

µ =(S(A)+S(B))− (S(C)+S(D))

p×q.

Trackable features include both bright spots and dark spots inthe scene image. For a bright spot, I(x,y)− µ is maximum at thetarget location. Dark spots produce minima at target locations. Ini-tial placement of the trackbox determines whether the feature tobe tracked is a bright or dark spot, based on the sign of the initialevaluation of I(x,y)−µ .

Some features cannot be correctly tracked because they exit thecamera field. For this study three trackboxes were sufficient toproperly track all areas of interest within the scene viewed by par-ticipants in the study. Extra trackboxes were placed and the threethat appeared to be providing the best track were selected manu-ally. Our implementation output a text file and a video. The text filecontained one line per frame of video. Each line included a framenumber, the (x,y) coordinates of each trackbox, the (x,y) coordi-nates of the corresponding gaze point, and a confidence number.See Figure 6 for a sample frame of the output video. Note the framenumber in the upper left corner.

The video was visually inspected to determine frame numbersfor the beginning and end of stimulus presentation, and most us-able trackboxes. Text files were then manually edited to removeextraneous information.

3.7 AOI LabelingThe most recent approach to AOI tracking used structure from mo-tion to compute 3D information from eye gaze data [Munn and Pelz2008]. We found such complex computation unnecessary becausewe did not need 3D information. We only wanted analysis of fixa-tions in AOIs. While structure from motion is able to extract 3D in-formation including head movement, it assumes a static scene. Ourmethod makes no such assumption, AOIs may move independentlyfrom the observer, and independently from each other. Structurefrom motion can however handle some degree of occlusion that ourapproach does not. Trackboxes are unable to locate any feature thatbecomes obstructed from view.

AOI labeling begins with the text files containing gaze data andtrack box locations as described above. The text files were thenautomatically parsed and fed into our fixation detection algorithm.Using the location of the trackboxes at the end of fixation, we wereable to assign AOI labels to each fixation. For each video a shortprogram was written to apply translation, rotation, and scaling be-fore labeling the fixations, with selected trackboxes defining thelocal frame of reference. The programs varied slightly depending

239

http://www.eyegaze.com

http://freedville.com/professional/thesis/eyetrack-readme.html

http://andrewd.ces.clemson.edu/courses/cpsc412/fall08

Figure 9: Labeling AOIs. Trackboxes, usually at image corners,are used to maintain position and orientation of the 9-window dis-play panel; each of the AOIs is labeled in sequential alphanumericorder from top-left to bottom-right—the letter ‘O’ is used to recordwhen a fixation falls outside of the display panels. In this screen-shot, the viewer is looking at the purple flower field.

upon which track boxes were chosen. For example, consider a fix-ation detected at location x, with trackboxes t1, t2, t3, and AOIs A,B, . . ., I as illustrated in Figure 8. Treating t1 as the origin of thereference frame, trackboxes t2 and t3 as well as the fixation x aretranslated to the origin by subtracting the coordinates of trackboxt1. Following translation, the coordinates of trackbox t2 define therotation angle, θ = tan−1 (t2y/t2x). A standard rotation matrix isused to rotate fixation point x to bring it in alignment with the hor-izontal x-axis. Finally, if trackbox t3 is located two-thirds acrossand down the panel display, then the fixation coordinates are scaledby 2/3. The now axis-aligned and scaled fixation point x is checkedfor which third of the axis-aligned box it is positioned in and the ap-propriate label is assigned. Note that this method of AOI trackingis scale- and 2D-rotationally-invariant. It is not, however, invari-ant to shear, resulting from feature rotation in 3D (e.g., perspectiverotation).

Following fixation localization, another text file is then outputwith one line per fixation. Each line contains the subject number,stimulus identifier, AOI label, and fixation duration. This informa-tion is then reformatted for subsequent statistical analysis by thestatistical package used (R in this case).

4 Applied ExampleIn an experiment conducted to better understand the potential healthbenefits of images of nature in a hospital setting, participants’ gazewas recorded along with physiological and self-reported psycho-logical data collected.

Eye Movement Analysis. For analysis of fixations within AOIs,trackboxes were placed at the corners of the corners of the 3×3panel display in the scene video. All 9 AOIs were assumed to beequally-sized connected rectangles (see Figure 9). Trackboxes wereused to determine AOI position orientation and scale. Out of planerotation was not considered. Trackboxes on the outside corners ofthe 3×3 grid were preferred. Otherwise linear interpolation wasused to determine exterior boundaries of the grid.

Stimulus. Using the prospect-refuge theory of landscape prefer-ence [Appleton 1996], four different categories of images (see Fig-ure 10) were viewed by participants before and after undergoinga pain stressor (hand in ice water for up to 120 seconds). A fifthgroup of participants (control) viewed the same display wall (seebelow) with the monitors turned off.

Apparatus, Environment, & Data Collected. Participantsviewed each image on a display wall consisting of nine video mon-itors arranged in a 3×3 grid. Each of the nine video monitors’display areas measured 36′′ wide × 21′′ high, with each monitorframed by a 1/2′′ black frame for an overall measurement of 9′wide × 5′3′′ high.

The mock patient room measured approximately 15.6′ × 18.6′.Participants viewed the display wall from a hospital bed facing themonitors. The bed was located approximately 5′3′′ from the dis-play wall with its footboard measuring 3.6′ high off the floor (themonitors were mounted 3′ from the floor). As each participant layon the bed, their head location measured approximately 9.6′ to thecenter of the monitors. Given these dimensions and distances andusing θ = 2tan−1 (r/(2D)) to represent visual angle, with r = 9′and D = 9.6′, the monitors subtended θ = 50.2◦ visual angle.

Pain perception, mood, blood pressure, and heart rate were con-tinually assessed during the experiment. Results from these mea-surements are omitted here, they are mentioned to give the reader asense of the complete procedure employed in the experiment.

Procedure. Each participant was greeted and asked to providedocumentation of informed consent. After situating themselves onthe bed facing the display wall, each participant involved in the eyetracking portion of the study donned the wearable eye tracker. Alaptop was then placed in front of them on a small rolling tableand the participant was asked to view the calibration dot sequence.Following calibration, each participant viewed the image stimulus(or blank monitors) for two minutes as timed by a stopwatch.

Subjects. 109 healthy college students took part in the study,with a small subsample (21) participating in the eye tracking por-tion.

Experimental Design. The study used a mixed randomized de-sign. Analysis of recorded gaze points by participants wearing theeye tracker was performed based on a repeated measures designwhere the set of fixations generated by each individual was treatedas the within-subjects fixed factor.

Discarded Data. Four recordings were collected over each offour stimulus images with four additional recordings displayingno image as control. There was one failed attempt to record dataover the purple flower field stimulus. A replacement recording wasmade. There were 21 sessions in all.

Ten recordings were discarded during post processing becausevideo quality prohibited effective eye tracking. In each of thesevideos some combination of multiple factors rendered them unus-able. These factors included heavy mascara, eyelid occlusion, fre-quent blinking, low contrast between iris and sclera, poor position-ing of eye cameras, and calibration dots not in the field of view. Wesuccessfully processed 2 control, 4 yellow field, 1 tree, 2 fire, and 2purple flower field videos.

Poor camera positioning could have been discovered and cor-rected if the cameras provided real-time video feedback. Our hard-ware did not support online processing. Online processing couldhave provided additional feedback allowing for detection and miti-gation of most other video quality issues.

5 ResultsUsing AOIs and image type as fixed factors (with participant as therandom factor [Baron and Li 2007]), repeated-measures two-wayANOVA indicates a marginally significant main effect of AOI onfixation duration (F(9,1069) = 2.08, p < 0.05, see Figure 11).2 Av-eraging over image types, pair-wise t-tests with pooled SD indicate

2Assuming sphericity as computed by R.

240

(a) (b) (c) (d)

Figure 10: Stimulus images: (a) yellow field: prospect (Getty Images), (b) tree: refuge (Getty Images), (c) fire: hazard (Getty Images), (d)purple flower field: mixed prospect and refuge (courtesy Ellen A. Vincent).

600

800

1000

1200

1400

1600

1800

A B C D E F G H I O

Me

an

Fix

atio

n D

ura

tio

ns (

in m

s;

with

SE

)

AOI

Fixation Durations vs. AOI

Figure 11: Comparison of mean fixation duration per AOI aver-aged over image types, with standard error bars.

no significant differences in fixation durations between any pair ofAOIs.

Repeated-measures ANOVA also indicates a significant main ef-fect of image type on fixation duration (F(34,1044) = 1.78, p <0.01), with AOI × image interaction not significant (see Figure 12).Averaging over AOIs, pair-wise t-tests with pooled SD indicate sig-nificantly different fixation durations between the control image(blank screen) and the tree image (p < 0.01, with Bonferroni cor-rection). No other significant differences were detected.

6 DiscussionAveraging over image types, the marginally significant differencein fixation durations over AOIs suggests that longest durations tendto fall on central AOIs (E and H). This simply suggests that viewerstend to fixate the image center. This is not unusual, particularlyin the absence of a specific viewing task [Wooding 2002]. Post-hoc pair-wise comparisons failed to reveal significant differences,which is likely due to the relatively high variability of the data.

Averaging over AOIs shows that the tree image drew signifi-cantly shorter fixations than the control (blank) screen. Due to av-eraging, however, it is difficult to infer further details regarding fix-ation duration distributions over particular image regions. Cursoryexamination of Figure 12 suggests shorter fixations over the centerpanels (E & H), compared to the longer dwell times made when thescreen was blank. Considering the averaging inherent in ANOVA,this could just mean that fixations are more evenly distributed overthe tree image than over the blank display, where it is fairly clearthat viewers mainly looked at the center panels. This may suggesta greater amount of visual interest offered by the tree image and

0

500

1000

1500

2000

2500

3000

3500

4000

A B C D E F G H I O

Me

an

Fix

atio

n D

ura

tio

ns (

in m

s;

with

SE

)

AOI

Fixation Durations vs. AOI

controlyellow field

treefire

lavender field

Figure 12: Comparison of mean fixation duration per AOI and perimage type, with standard error bars.

a propensity of viewers to look around more when presented withstimulus than when there is nothing of interest at all.

A similar observation could be made regarding fixation durationsfound over region C (upper right) for the purple flower field image,an image with which viewers perceived lower sensory pain com-pared to those who viewed other landscape images and no imageswith statistical significance at α = 0.1 [Vincent et al. 2009]. How-ever, the difference in fixation durations over region C is not signif-icant according to the pair-wise post-hoc analysis.

7 ConclusionA match-moving approach was presented to help automate analy-sis of eye movements collected by a wearable eye tracker. Tech-nical contributions addressed video stream synchronization, pupildetection, eye movement analysis, and tracking of dynamic sceneAreas Of Interest (AOIs). The techniques were demonstrated inthe evaluation of eye movements on images of nature viewed bysubjects participating in an experiment on the perception of well-being. Although descriptive statistics of gaze locations over AOIsfailed to show significance of any particular AOI except the center,the methodology is applicable toward similar future studies.

ReferencesANLIKER, J. 1976. Eye Movements: On-Line Measurement, Anal-

ysis, and Control. In Eye Movements and Psychological Pro-cesses, R. A. Monty and J. W. Senders, Eds. Lawrence ErlbaumAssociates, Hillsdale, NJ, 185–202.

APPLETON, J. 1996. The Experience of Landscape. John Wiley &Sons, Ltd., Chicester, UK.

241

BABCOCK, J. S. AND PELZ, J. B. 2004. Building a LightweightEyetracking Headgear. In ETRA ’04: Proceedings of the 2004Symposium on Eye Tracking Research & Applications. ACM,San Antonio, TX, 109–114.

BALLARD, D. H., HAYHOE, M. M., AND PELZ, J. B. 1995. Mem-ory Representations in Natural Tasks. Journal of Cognitive Neu-roscience 7, 1, 66–80.

BARON, J. AND LI, Y. 2007. Notes on the use of R for psy-chology experiments and questionnaires. Online Notes. URL:<http://www.psych.upenn.edu/∼baron/rpsych/rpsych.html>(last accessed December 2007).

BUSWELL, G. T. 1935. How People Look At Pictures. Universityof Chicago Press, Chicago, IL.

CROW, F. C. 1984. Summed-area tables for texture mapping. InSIGGRAPH ’84: Proceedings of the 11th Annual Conferenceon Computer Graphics and Interactive Techniques. ACM, NewYork, NY, 207–212.

DUCHOWSKI, A., MEDLIN, E., COURNIA, N., GRAMOPADHYE,A., NAIR, S., VORAH, J., AND MELLOY, B. 2002. 3D EyeMovement Analysis. Behavior Research Methods, Instruments,Computers (BRMIC) 34, 4 (November), 573–591.

FREED, A. R. 2003. The Effects of Interface Design on TelephoneDialing Performance. M.S. thesis, Pennsylvania State Univer-sity, University Park, PA.

JACOB, R. J. K. AND KARN, K. S. 2003. Eye Tracking in Human-Computer Interaction and Usability Research: Ready to Deliverthe Promises. In The Mind’s Eye: Cognitive and Applied Aspectsof Eye Movement Research, J. Hyönä, R. Radach, and H. Deubel,Eds. Elsevier Science, Amsterdam, The Netherlands, 573–605.

LANCASTER, P. AND ŠALKAUSKAS, K. 1986. Curve and SurfaceFitting: An Introduction. Academic Press, San Diego, CA.

LAND, M., MENNIE, N., AND RUSTED, J. 1999. The Roles ofVision and Eye Movements in the Control of Activities of DailyLiving. Perception 28, 11, 1307–1432.

LAND, M. F. AND HAYHOE, M. 2001. In What Ways DoEye Movements Contribute to Everyday Activities. Vision Re-search 41, 25-26, 3559–3565. (Special Issue on Eye Movementsand Vision in the Natual World, with most contributions to thevolume originally presented at the ‘Eye Movements and Visionin the Natural World’ symposium held at the Royal NetherlandsAcademy of Sciences, Amsterdam, September 2000).

LI, D. 2006. Low-Cost Eye-Tracking for Human Computer Inter-action. M.S. thesis, Iowa State University, Ames, IA. TechreportTAMU-88-010.

LI, D., BABCOCK, J., AND PARKHURST, D. J. 2006. openEyes: ALow-Cost Head-Mounted Eye-Tracking Solution. In ETRA ’06:Proceedings of the 2006 Symposium on Eye Tracking Research& Applications. ACM, San Diego, CA.

LI, D. AND PARKHURST, D. 2006. Open-Source Software forReal-Time Visible-Spectrum Eye Tracking. In Conference onCommunication by Gaze Interaction. COGAIN, Turin, Italy.

LI, D., WINFIELD, D., AND PARKHURST, D. J. 2005. Star-burst: A hybrid algorithm for video-based eye tracking com-bining feature-based and model-based approaches. In Visionfor Human-Computer Interaction Workshop (in conjunction withCVPR).

MEGAW, E. D. AND RICHARDSON, J. 1979. Eye Movements andIndustrial Inspection. Applied Ergonomics 10, 145–154.

MORIMOTO, C. H. AND MIMICA, M. R. M. 2005. Eye GazeTracking Techniques for Interactive Applications. Computer Vi-sion and Image Understanding 98, 4–24.

MUNN, S. M. AND PELZ, J. B. 2008. 3D point-of-regard, positionand head orientation from a portable monocular video-based eyetracker. In ETRA ’08: Proceedings of the 2008 Symposium onEye Tracking Research & Applications. ACM, Savannah, GA,181–188.

MUNN, S. M., STEFANO, L., AND PELZ, J. B. 2008. Fixation-identification in dynamic scenes: Comparing an automated al-gorithm to manual coding. In APGV ’08: Proceedings of the5th Symposium on Applied Perception in Graphics and Visual-ization. ACM, New York, NY, 33–42.

PAOLINI, M. 2006. Apple Pro Training Series: Shake 4. PeachpitPress, Berkeley, CA.

PELZ, J. B., CANOSA, R., AND BABCOCK, J. 2000. ExtendedTasks Elicit Complex Eye Movement Patterns. In ETRA ’00:Proceedings of the 2000 Symposium on Eye Tracking Research& Applications. ACM, Palm Beach Gardens, FL, 37–43.

REICH, S., GOLDBERG, L., AND HUDEK, S. 2004. Deja ViewCamwear Model 100. In CARPE’04: Proceedings of the 1stACM Workshop on Continuous Archival and Retrieval of Per-sonal Experiences. ACM Press, New York, NY, 110–111.

RYAN, W. J., DUCHOWSKI, A. T., AND BIRCHFIELD, S. T. 2008.Limbus/pupil switching for wearable eye tracking under variablelighting conditions. In ETRA ’08: Proceedings of the 2008 Sym-posium on Eye Tracking Research & Applications. ACM, NewYork, NY, 61–64.

SALVUCCI, D. D. AND GOLDBERG, J. H. 2000. Identifying Fix-ations and Saccades in Eye-Tracking Protocols. In ETRA ’00:Proceedings of the 2000 Symposium on Eye Tracking Research& Applications. ACM, Palm Beach Gardens, FL, 71–78.

SMEETS, J. B. J., HAYHOE, H. M., AND BALLARD, D. H. 1996.Goal-Directed Arm Movements Change Eye-Head Coordina-tion. Experimental Brain Research 109, 434–440.

VINCENT, E., BATTISTO, D., GRIMES, L., AND MCCUBBIN, J.2009. Effects of nature images on pain in a simulated hospitalpatient room. Health Environments Research and Design. Inpress.

WEBB, N. AND RENSHAW, T. 2008. Eyetracking in HCI. In Re-search Methods for Human-Computer Interaction, P. Cairns andA. L. Cox, Eds. Cambridge University Press, Cambridge, UK,35–69.

WOODING, D. 2002. Fixation Maps: Quantifying Eye-MovementTraces. In Proceedings of ETRA ’02. ACM, New Orleans, LA.

242

http://www.psych.upenn.edu/~baron/rpsych/rpsych.html

Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks

Documents

Transcript of Ryan Match Moving For Area Based Analysis Of Eye Movements In Natural Tasks