Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags

Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags

Tina Walber, Ansgar Scherp, Steffen StaabUniversity of Koblenz-Landau, Koblenz, Germany

Multimedia Modeling ConferenceKlagenfurt, AustriaJanuary 4-6, 2012

T. Walber, A. Scherp, S. Staab – Identifying Objects in Images 2 of 21

Motivation: Image Tagging

Find specific objects in images Analyzing the user’s gaze path only

sidewalk

car

store

tree

people

girl


Research Questions

1.Best fixation measure to find the correct image region given a specific tag?

2. Can we differentiate two regions in the same image?


3 Steps Conducted by Users

Look at red blinking dot Decide whether tag can be seen (“y” or “n”)


Dataset LabelMe community images

Manually drawn polygons Regions annotated with tags

182.657 images (August 2010)

High-quality segmentation and annotation

Used as ground truth


Experiment Images and Tags Randomly selected 51 images Contain at least two tagged regions

Created two tag sets for the 51 images Each image is assigned two tags (one per set)

Keep subjects concentrated during experiment

Tags are either “true” or “false” “true” object described by tag can be seen “false” object cannot be seen on the image


Subjects & Experiment System 20 subjects

16 male, 4 female (age: 23-40, Ø=29.6) Undergrads (6), PhD (12), office clerks (2)

Experiment system Simple web page in Internet Explorer Standard notebook, resolution 1680x1050 Tobii X60 eye-tracker (60 Hz, 0.5° accuracy)


Conducting the Experiment Each user looked at 51 tag-image-pairs First tag-image-pair dismissed

94.3% correct answers Equal for true/false tags ~3s until decision (average)

85% of users strongly agreed or agreed that they felt comfortable during the experiment

Eyetracker did not much influence comfort


Pre-processing of Eye-tracking Data Obtained 547 gaze paths from 20 users where

Users gave correct answers Image has “true” tag assigned

Fixation extraction Tobii Studio’s velocity & distance thresholds Fixation: focus on particular point on screen

One fixation inside or near the correct region 476 (87%) gaze paths fulfill this requirement


Analysis of Gaze Fixations (1) Applied 13 fixation measures on the 476 paths (2 new, 7 standard Tobii , 4 literature)

Fixation measure: function on users’ gaze paths Calculated for each image region, over all users

viewing the same tag-image-pair


Considered Fixation MeasuresNr Name Favorite region r Origin

1 firstFixation No. of fixations before 1st on r Tobii

2 secondFixation No. of fixations before 2nd on r [13]

3 fixationsAfter No. of fixations after last on r [4]

4 fixationsBeforeDecision

fixationsAfter, but before decision New

5 fixationsAfterDecision

fixationsBeforeDecision and after New

6 fixationDuration Total duration of all fixations on r Tobii

7 firstFixationDuration

Duration of first fixation on r Tobii

8 lastFixationDuration

Duration of last fixation on r [11]

9 fixationCount Number of fixations on r Tobii

10 maxVisitDuration Max time first fixation until outside r Tobii

11 meanVisitDuration Mean time first fixation until outside r Tobii

12 visitCount No. of fixations until outside r Tobii

13 saccLength Saccade length, before fixation on r [6]


Analysis of Gaze Fixations (2)

For every image region (b) the fixation measure is calculated over all gaze paths (c)

Results are summed up per region Regions ordered according to fixation measure If favorite region (d) and tag (a) match, result is

true positive (tp), otherwise false positive (fp)


Precision per Fixation MeasuremeanVisitDuration

Sum

of t

p an

d fp

ass

ignm

ents

Fixation measures

P

fixationsBeforeDecision lastFixationDuration

fixationDuration


Adding Boundaries and Weights Take eye-tracker inaccuracies into account Extension of region boundaries by 13 pixels

Larger regions more likely to be fixated Give weight to regions < 5% of image size

meanVisitDuration increases to P = 0.67


Examples: Tag-Region-Assignments


Comparison with Baselines

Naïve baseline: largest region r is favorite Random baseline: randomly select favorite r

Gaze / Gaze* significantly better (χ², α<0.001)


Effect of Gaze Path Aggregation

+46%

+4%

Aggregation of precision P for Gaze*

Single user still significantly better (χ² for naive with α<0.001 and random with α<0.002)

P

Number of gaze paths used


Research Questions



meanVisitDuration with precision of 67%


Differentiate Two Objects Use second tag set to identify different objects in the same image

16 images (of our 51) have two “true” tags 6 images had two correct regions identified

Proportion of 38%

Average precision for single object is 67%

Correct tag assignment for two images: 44%


Correctly Differentiated Objects


Research Questions



meanVisitDuration with precision of 67%

Accuracy of 38%

Acknowledgement: This research was partially supported by the EU projects Petamedia (FP7-216444) and SocialSensor (FP7-287975).


Influence of Red Dot

First 5 fixations, over all subjects and all images


Experiment Data Cleaning Manually replaced images with

a) Tags that are incomprehensible, require expert-knowledge, or nonsense

b) Tag refers to multiple regions, but not all are drawn into the image (e.g., bicycle)

c) Obstructed objects (bicycle behind a car)

d) “False”-tag actually refers to a visible part of the image and thus were “true” tags

Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags

Technology

Transcript of Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Provided Tags