Following Gaze in Video - Columbia University › ~vondrick › videogaze.pdfmovies. VideoGaze...

9
Following Gaze in Video Adri` a Recasens Carl Vondrick Aditya Khosla Antonio Torralba Massachusetts Institute of Technology {recasens, vondrick, khosla, torralba}@csail.mit.edu 0.8 0 a) b) c) d) -20 -10 10 20 Frames Gaze score 0.5 0 Figure 1: a) What is Tom Hanks looking at? When we watch a movie, understanding what a character is paying attention to requires reasoning about multiple views. Many times, the character will be looking at something that fall outside the frame, just like in (a), and detecting what object the character is looking at can not be addressed by previous saliency and gaze following models. Solving this problem requires analyzing gaze, making use of semantic knowledge about the typical 3D relationships between different views, and recognizing the objects that are the common targets of attention, just like we do when watching a movie. Here we study the problem of gaze following in video where the object attended by a character might appear only on a separate frame. Given a video (b) around the frame containing the character (t =0) our system selects the frames likely to contain the object attended by the selected character (c) and produces the output shown in (d). This figure shows an actual result from our system. Abstract Following the gaze of people inside videos is an impor- tant signal for understanding people and their actions. In this paper, we present an approach for following gaze in video by predicting where a person (in the video) is look- ing even when the object is in a different frame. We collect VideoGaze, a new dataset which we use as a benchmark to both train and evaluate models. Given one frame with a person in it, our model estimates a density for gaze location in every frame and the probability that the person is look- ing in that particular frame. A key aspect of our approach is an end-to-end model that jointly estimates: saliency, gaze pose, and geometric relationships between views while only using gaze as supervision. Visualizations suggest that the model learns to internally solve these intermediate tasks automatically without additional supervision. Experiments show that our approach follows gaze in video better than existing approaches, enabling a richer understanding of hu- man activities in video. 1. Introduction Can you tell where Tom Hanks (in Fig. 1(a)) is looking? You might observe that there is not enough information in the frame to predict the location of his gaze. However, if we search the neighboring frames of the given video (shown in Fig. 1(b)), we can identify he is looking at the woman (illustrated in Fig. 1(d)). In this paper, we introduce the problem of gaze following in video. Specifically, given a video frame with a person, and a set of neighboring frames from the same video, our goal is to identify which of the neighboring frames (if any) contain the object being looked at, and the location on that object that is being gazed upon. Importantly, we observe that this task requires both a semantic and geometric understanding of the video. For example, semantic understanding is required to identify frames that are from the same scene (e.g., indoor and outdoor frames are unlikely to be from the same scene) while geometric understanding is required to localize ex- actly where the person is looking in a novel frame using the head pose and geometric relationship between the frames. Based on this observation, we propose a novel convolu- tional neural network based model that combines semantic and geometric understanding of frames to follow an individ- ual’s gaze in a video. Despite encapsulating the structure of the problem, our model requires minimal supervision and produces an interpretable representation of the problem. In order to train and evaluate our model, we collect

Transcript of Following Gaze in Video - Columbia University › ~vondrick › videogaze.pdfmovies. VideoGaze...

  • Following Gaze in Video

    Adrià Recasens Carl Vondrick Aditya Khosla Antonio TorralbaMassachusetts Institute of Technology

    {recasens, vondrick, khosla, torralba}@csail.mit.edu

    0.8

    0a)

    b)

    c) d)-20 -10 10 20 Frames

    Ga

    ze s

    core

    0.5

    0

    Figure 1: a) What is Tom Hanks looking at? When we watch a movie, understanding what a character is paying attention torequires reasoning about multiple views. Many times, the character will be looking at something that fall outside the frame,just like in (a), and detecting what object the character is looking at can not be addressed by previous saliency and gazefollowing models. Solving this problem requires analyzing gaze, making use of semantic knowledge about the typical 3Drelationships between different views, and recognizing the objects that are the common targets of attention, just like we dowhen watching a movie. Here we study the problem of gaze following in video where the object attended by a charactermight appear only on a separate frame. Given a video (b) around the frame containing the character (t = 0) our systemselects the frames likely to contain the object attended by the selected character (c) and produces the output shown in (d).This figure shows an actual result from our system.

    Abstract

    Following the gaze of people inside videos is an impor-tant signal for understanding people and their actions. Inthis paper, we present an approach for following gaze invideo by predicting where a person (in the video) is look-ing even when the object is in a different frame. We collectVideoGaze, a new dataset which we use as a benchmark toboth train and evaluate models. Given one frame with aperson in it, our model estimates a density for gaze locationin every frame and the probability that the person is look-ing in that particular frame. A key aspect of our approachis an end-to-end model that jointly estimates: saliency, gazepose, and geometric relationships between views while onlyusing gaze as supervision. Visualizations suggest that themodel learns to internally solve these intermediate tasksautomatically without additional supervision. Experimentsshow that our approach follows gaze in video better thanexisting approaches, enabling a richer understanding of hu-man activities in video.

    1. Introduction

    Can you tell where Tom Hanks (in Fig. 1(a)) is looking?You might observe that there is not enough information in

    the frame to predict the location of his gaze. However, if wesearch the neighboring frames of the given video (shownin Fig. 1(b)), we can identify he is looking at the woman(illustrated in Fig. 1(d)). In this paper, we introduce theproblem of gaze following in video. Specifically, given avideo frame with a person, and a set of neighboring framesfrom the same video, our goal is to identify which of theneighboring frames (if any) contain the object being lookedat, and the location on that object that is being gazed upon.

    Importantly, we observe that this task requires both asemantic and geometric understanding of the video. Forexample, semantic understanding is required to identifyframes that are from the same scene (e.g., indoor andoutdoor frames are unlikely to be from the same scene)while geometric understanding is required to localize ex-actly where the person is looking in a novel frame using thehead pose and geometric relationship between the frames.Based on this observation, we propose a novel convolu-tional neural network based model that combines semanticand geometric understanding of frames to follow an individ-ual’s gaze in a video. Despite encapsulating the structure ofthe problem, our model requires minimal supervision andproduces an interpretable representation of the problem.

    In order to train and evaluate our model, we collect

  • Figure 2: VideoGaze Dataset: We present a novel large-scale dataset for gaze-following in video. Every person annotatedin the dataset has its gaze annotated in five neighbor frames. We show some annotated examples from the dataset. In red, theframes without the gazed object on it. In green, we show the gaze annotations from the dataset.

    a large scale dataset for gaze following in videos. Ourdataset consists of around 50,000 people in short videos an-notated with where they are looking throughout the video.We evaluate the performance of a variety of baseline ap-proaches (e.g., saliency, gaze prediction in images, etc) onour dataset, and show that our model outperforms all exist-ing approaches.

    There are three main contributions of this paper. First,we introduce the problem of following gaze in videos. Sec-ond, we collect a large scale dataset for both training andevaluation on this task. Third, we present a novel net-work architecture that leverages the geometry of the sceneto tackle this problem. The remainder of this paper detailsthese contributions. In Section 2 we explore related work.In Section 3 we describe our dataset, VideoGaze. In Sec-tion 4, we describe the model in detail, and finally in Sec-tion 5 we evaluate the model and provide sample results.

    2. Related Work

    We describe the related works in the areas of gaze-following in both videos and images, deep learning for ge-ometry prediction and saliency below.

    Gaze-following in video: Previous works video gaze-following deal with very restricted settings. Most no-tably [21, 20] tackles the problem of detecting people look-ing at each other in video, by using their head pose and loca-tion inside the frame. Although our model can be used withthis goal, it is applicable to a wide variety of settings: it canpredict gaze when it is located elsewhere in the image (notonly on humans) or future/past frame of the video. Mukher-jee and Robertson [22] use RGB-D images to predict gazein images and videos. They estimate the head-pose of the

    person using the multi-modal RGB-D data, and finally theyregress the gaze location with a second system. Althoughthe output of their system is gaze location, our model doesnot need multi-modal data and it is able to deal with gazelocation in a different view. Extensive work has been doneon human interaction and social prediction on both imagesand video involving gaze [33, 13, 4]. Some of this work isfocused on ego-centric camera data, such as in [9, 8]. Fur-thermore, [24, 30] predicts social saliency, that is, the regionthat attracts attentions of a group of people in the image. Fi-nally, [4] estimates the 3D location and pose of the people,which is used to predict social interaction. Although theirgoal is completely different, we also model the scene withexplicit 3D and use it to predict gaze.

    Gaze-following in images: Our model is inspired by aprevious gaze-following model for static images [26]. How-ever, the previous work focuses only on cases where a per-son, within the image, is looking at another object in thesame image. In this work, we remove this restriction andextend gaze following to video. The model proposed inthis paper deals with the situation where the person is look-ing at another frame in the video. Further, unlike [26], weuse parametrized geometry transformations that help themodel to deal with the underlying geometry of the world.There have also been recent works in applying deep learn-ing to eye-tracking [16, 35] that predict where an individ-ual is looking on a device. Furthermore, [32] introduces aneye-tracking technique which makes the calibration processavoidable. Finally, our work is also related to [5], whichpredicts the object of interaction in images.

    Deep Learning with Geometry: Neural networkshave previously been used to model geometric transforma-

  • target frame xi

    head xh

    head location ue

    Saliency Pathway S(xi)

    Gaze PathwayC(xh, ue)

    Cone-PlaneIntersectionFC

    target frame xt

    source frame xs!transformation

    Transformation PathwayT(xt, xs)

    T1

    T2

    gaze prediction ŷ

    "Frame

    Selector

    frame probability

    Figure 3: Network Architecture: Our model has three pathways. The saliency pathway (top left) finds salient spots on thetarget view. The gaze pathway (bottom left) computes the parameters of the cone coming out from the person’s face. Thetransformation pathway (right) estimates the geometric relationship between views. The output is the gaze location densityand the probability of x

    t

    of containing the gazed object.

    tions [11, 12]. Our work is also related to Spatial Trans-formers Networks [14], where a localization module gen-erates the parameters of an affine transformation and warpsthe representation with bilinear interpolation. Our modelgenerates parameters of a 3D affine transformation, butthe transformation is applied analytically without warping,which is likely to be more stable. [28, 6] used 2D images tolearn the underlying 3D structure. Similarly, we expect ourmodel to learn the 3D structure of the frame compositiononly using 2D images. Finally, [10] provide efficient imple-mentations for adding geometric transformations to CNNs.

    Saliency: Although related, gaze-following and free-viewing saliency refer to different problems. In gaze-following, we predict the location of the gaze of an ob-server in the scene, while in saliency we predict the fixa-tions of an external observer free-viewing the image. Someauthors have used gaze to improve saliency prediction, suchas in [25]. Furthermore, [2] showed how gaze predictioncan improve state-of-the-art saliency models. Although ourapproach is not intended to solve video saliency, we be-lieve it is worth mentioning some works learning saliencyfor videos such as [18, 34, 19].

    3. VideoGaze Dataset

    We introduce VideoGaze, a large scale dataset con-taining the location where film characters are looking inmovies. VideoGaze contains 166, 721 annotations from140 movies. To build the dataset we used videos from theMovieQA dataset [31], which we consider a representativeselection of movies. Each sample of the dataset consists ofsix frames. The first frame contains the character whosegaze is annotated. Eye location and a head bounding boxfor the character are provided. The other five frames containthe gaze location that the character is looking at the time, ifpresent in the frame. Figure 2 contains three samples from

    the dataset. On the left column we show the frame with thecharacter on it. The other five frames are shown in the rightwith the gaze annotation if available (green).

    To annotate the dataset, we used Amazon’s Mechani-cal Turk (AMT). We annotated our dataset in two separatesteps. In the first step, the workers were asked to first locatethe head of the character and then scan through the videoto find the location of the object the character is lookingat. For cost efficiency reasons, we restricted the workers toonly scan a 6 seconds temporal window around the framewith the character. In pilot experiments, we found this win-dow to be sufficient. We also provided options to indicatethat the gazed object never appears in the clip or that thehead of the character was not visible in the scene. In thesecond step, we temporally sampled four additional framesnearby the first annotated frame and ask the Turkers to an-notate the gazed object if present. Using this two-step pro-cess we ensure that if the gazed object appears in the video,it is annotated in our VideoGaze.

    We split our data into training set and test set. We useall the annotations from 20 movies as the testing set and therest of the annotations as training set. Note that we made thetrain/test split by source movie, not by clip, which preventsoverfitting to particular movies. Additionally, we annotatedfive times one frame per each sample in the test set. Weused this data to perform a robust evaluation of our methodsand compute a human performance. Finally, for the sameframes, we also annotated the similarity between the framewith the character and the frame with the object. In figure8 we use the similarity annotation to evaluate performanceversus different levels of similarity.

    4. Method

    Suppose we have a video and a person inside the video.Our goal is to predict where the person is looking, which

  • !"

    #$ systemofcoordinates #% systemofcoordinates

    &((), +,)

    #$ #%

    G(+., +/, ()) T(+., +/)

    Figure 4: Transformation and intersection: The conepathway computes the cone parameters v and ↵, and thetransformation pathway estimates the geometric relationamong the original view and the target view. The cone ori-gin is u

    e

    and xh

    is indicated with the blue bounding box.

    may possibly be in another frame of the video. Let xs

    be asource frame where the person is located, x

    h

    be an imagecrop containing only the person’s head, and u

    e

    be the coor-dinates of the eyes of the person within the frame x

    s

    . Letx be a set of frames that we want to predict where a personis looking (if any). We wish to both select a target framex

    t

    2 x that the object of gaze appears in and then predictthe coordinates of the person’s gaze ŷ in x

    t

    .We first explain how to predict ŷ given x

    t

    . Then, wediscuss how to learn to select x

    t

    .

    4.1. Multi-Frame Gaze Network

    Suppose we are given xt

    . We can design a convolutionalneural network F (x

    s

    , x

    h

    , u

    e

    , x

    t

    ) to predict the spatial lo-cation ŷ. While we could simply concatenate these inputsand train a network, the internal representation would bedifficult to interpret and may require large amounts of train-ing data to discover consistent patterns, which is inefficient.Instead, we seek to take advantage of the geometry of thescene to better predict people’s gaze.

    To follow gaze across frames, the network must be ableto solve three sub-problems: (1) estimate the head pose ofthe person, (2) find the geometric relationship between theframe where the person is and the frame where the gazelocation might be, and (3) find the potential locations inthe target frame where the person might be looking (salientspots). We design a single model that internally solves eachof these sub-problems even though we supervise the net-work only with the gaze annotations.

    With this structure in mind, we design a convolutionalnetwork F to predict ˆh for a target frame x

    t

    :

    F (x

    s

    , x

    h

    , u

    e

    , x

    t

    ) = S(x

    t

    )�G(ue

    , x

    s

    , x

    t

    ) (1)

    where S(·) and G(·) are decompositions of the originalproblem. Both S(·) and G(·) produce a positive matrixin Rk⇥k with k being the size of the spatial maps and �is the element-wise product. Although we only superviseF (·), our intention is that S(·) will learn to detect salient

    objects and G(·) will learn to estimate a mask of all the lo-cations where the person could be looking in x

    t

    . We usethe element-wise product as an “and operation” so that thenetwork predicts people are looking at salient objects thatare within their eyesight.

    S is parametrized as a neural network. The structure ofG is motivated to leverage the geometry of the scene. Wewrite G as the intersection of the person’s gaze cone with aplane representing the target frame x

    t

    transformed into thesame coordinate frame as x

    s

    :

    G(u

    e

    , x

    s

    , x

    t

    ) = C(u

    e

    , x

    h

    ) \ ⌧(T (xs

    , x

    t

    )) (2)

    where C(ue

    , x

    s

    ) 2 R7 estimates the parameters of a conerepresenting the person’s gaze in the original image x

    s

    ,T (x

    s

    , x

    t

    ) 2 R3⇥4 estimates the parameters of an affinetransformation of the target frame, and ⌧ applies the trans-formation. ⌧ is expected to compute the coordinates of x

    t

    inthe system of coordinates defined by x

    s

    . We illustrate thisprocess in Figure 4.

    4.2. Transformation ⌧

    We use an affine transformation to geometrically relatethe two frames x

    s

    and xt

    . Let Z be the set of coordinatesinside the square with corners (±1,±1, 0). Suppose the im-age x

    s

    is located in Z (xs

    is resized to have its corners in(±1,±1, 0)) . Then:

    ⌧(T ) = Tz 8z 2 Z (3)

    The affine transformation T is computing the geometric re-lation between both frames. To compute the parameters Twe used a CNN. We use T to transform the coordinates ofx

    t

    into the coordinate system defined by xs

    .In practice, we found it useful to output an additional

    scalar parameter �(xt

    , x

    s

    ) and define ⌧(T ) = �(xt

    , x

    s

    )Tz.The parameter � is expected to be used by the network toset G = 0 if no transformation can be found.

    4.3. Cone-Plane Intersection

    Given a cone parametrization of the gaze direction C anda transformed frame plane ⌧(T ), we wish to find the inter-section C \ ⌧(T ). The intersection is obtained by solvingthe following equation for �:

    T

    ⌃� = 0 where � = (�1,�2, 1) (4)

    where (�1,�2) are coordinates in the system of coordinatesdefined by x

    t

    , and ⌃ 2 R3⇥3 is a matrix defining thecone-plane intersection as in [3]. Solving Equation 4 forall � gives us the cone-plane intersection, however it is notdiscrete, which would not provide a gradient for learning.Therefore, we use an approximation to make the intersec-tion soft:

    C(u

    e

    , x

    h

    ) \ ⌧(T (xs

    , x

    t

    )) = �(�

    T

    ⌃�) (5)

  • where � is a sigmoid activation function. To compute theintersection, we calculate Equation 5 for �1,�2 2 [�1, 1].

    4.4. Frame Selection

    We described an approach to predict the spatial loca-tion ŷ where a person is looking inside a given frame x

    t

    .However, how should we pick the target frame x

    t

    ? Todo this, we can simultaneously estimate the probabilitythe person of interest is looking inside a frame x

    t

    . LetE (S(x

    t

    ), G(u

    e

    , x

    s

    , x

    t

    )) be this probability where E is aneural network.

    4.5. Pathways

    We estimate the parameters of the saliency map S, thecone C, and the transformation T using CNNs.

    Saliency Pathway: The saliency pathway uses the targetframe x

    t

    to generate a spatial map S(xt

    ). We used a 6-layer CNN to generate the spatial map from the input image.The five initial convolutional layers follow the structure ofAlexNet introduced by [17]. The last layer uses a 1 ⇥ 1kernel to merge the 256 channels in a simple k ⇥ k map.

    Cone Pathway: The cone pathway generates a coneparametrization from a close-up image of the head x

    h

    andthe eyes u

    e

    . We set the origin of the cone at the head ofthe person u

    e

    and let a CNN generate v 2 R3, the direc-tion of the cone and ↵ 2 R, its aperture. Figure 4 shows anschematic example of the cone generation.

    Transformation Pathway: The transformation pathwayhas two stages. We define T1, a 5-layer CNN followingthe structure defined in [17]. T1 is applied separately toboth the source frame x

    s

    and the target frame xt

    . We defineT2 which is composed by one convolutional layer and threefully connected layers reducing the dimensionality of therepresentation. The output of the pathway is computed as:T (x

    s

    , x

    t

    ) = T2(T1(xs), T1(xt)). We used [10] to computethe transformation matrix from output parameters.

    Discussion: We constrain each pathway to learn differ-ent aspects of the problem by providing each pathway onlya subset of the inputs. The saliency pathway only has ac-cess to the target frame x

    t

    , which is insufficient to solve thefull problem. Instead, we expect it to find salient objects inthe target view x

    t

    . Likewise, the transformation pathwayhas access to both x

    s

    and xt

    , and the transformation will belater used to project the gaze cone. We expect it to com-pute a transformation that geometrically relates x

    s

    and xt

    .We expect each of the pathways to solve its particular sub-problem to then get combined to generate the final output.Since every step is differentiable, it can be trained end-to-end without intermediate supervision.

    4.6. Learning

    Since gaze-following is a multi-modal problem, we trainF to estimate a spatial probability distribution q(x, y) in-

    stead of regressing a single gaze location. We use a gen-eralization of the spatial loss used in [26]. They use fivedifferent classification grids that are shifted and the predic-tions of each of them are combined. We generalize this lossby averaging over all the possible grids of different shiftsand sizes:

    L(p, q) =

    X

    w,h,�x

    ,�y

    Ew,h,�

    x

    ,�y

    (p, q) (6)

    where Ew,h,�

    x

    ,�y

    is a spatially smooth cross entropy withgrid cells sized w ⇥ h and shifted (�

    x

    ,�

    y

    ) spaces over.Instead of using q to compute the loss, E uses a smoothedversion of q where for each position (x, y) it sums up theprobability in the rectangle around. For simplicity, we writethis in one dimension:

    E

    w,�x

    = �X

    x

    p(x) log

    �=wX

    �=0

    q(x+�

    x

    + �) (7)

    which is similar to the cross-entropy loss function exceptthe spatial bins are shifted by �

    x

    and scaled by w. Thisexpression can be written as the output of a convolution,which is efficient to compute, and differentiable.

    4.7. Inference

    Our network F will produce a matrix A 2 R20⇥20, amap that can be interpreted as a density where the personis looking. To infer the gaze location ŷ in the target framex

    t

    , we find the mode of this density ŷ = argmaxi,j

    A

    ij

    .To select the target frame x

    t

    , we pick the frame with thehighest score from E.

    4.8. Implementation Details

    We implemented our model using PyTorch. In our ex-periments we use k = 13, the output of both the saliencypathway and the cone generator is a 13⇥13 spatial map. Wefound useful to add a final fully connected layer to upscalethe 13⇥13 spatial map to a 20⇥20 spatial map. We initial-ize the CNNs in the three pathways with ImageNet-CNN[17, 29]. The cone pathway has three fully connected layersof sizes 500, 200 and 4 to generate the cone parametriza-tion. The common part of the transformation pathway, T2,has one convolutional layer with a 1⇥1 kernel and 100 out-put channels, followed by one 2⇥ 2 max pooling layer andthree fully connected layers of 200, 100 and the parame-ter size of the transformation. E is a Multilayer Perceptronwith one hidden layer of 200 dimensions. For training, weaugment data by flipping x

    t

    and xs

    and their annotations.

    5. Experiments

    5.1. Evaluation Procedure

    To evaluate our model we conducted quantitative andqualitative analyses using our held out dataset. We use 4

  • Model Dist Min.Dist AUC KLStatic Gaze [26] 0.287 0.233 76.5 9.03Saliency 0.253 0.206 85.0 8.49Fixed bias 0.281 0.226 71.0 22.79Center 0.236 0.198 76.3 18.64Random 0.437 0.380 56.9 28.39Ours 0.184 0.123 89.0 7.76Human 0.103 0.063 90.1 10.59

    (a) Baselines

    Model Dist Min.Dist AUC KLCone Only 0.194 0.139 83.8 8.52Image Only 0.236 0.175 87.7 7.90Identity 0.201 0.141 86.6 8.04Translation Only 0.194 0.133 87.9 7.81Rotation Only 0.195 0.134 87.5 7.95Vertical axis rot 0.189 0.128 88.5 7.823-axis rot (Ours) 0.184 0.123 89.0 7.76

    (b) Ablation Analysis

    Model APRandom 75.1Closest 75.7Saliency 76.0Image only 83.9Cone only 86.7Vertical axis rot 87.1Ours 87.5

    (c) Frame selectionTable 1: Evaluation: In table (a) we compare our performance with the baselines. In table (b) we analyze the performanceof the different ablations of our model. In table (c) we analyze the ability of the model to select the target frame. We compareagainst baselines and ablations. AUC stands for Area Under the Curve and it is computed as the to the area under the ROCcurve. Dist. is computed as the L2 distance to the ground truth location. Min.Dist is computed as the minimum L2 distanceto one ground truth annotation. KL refers to the Kullback-Leibler divergence. AP stands for Average Precision, and is definedas the area under the precision-recall curve. Higher is better for AUC and AP. Lower is better for KL and L2 distances.

    ground truth annotations for evaluations and one to eval-uate human performance. Similar to [7], for quantitativeevaluation we provide bounding boxes for the heads of thepersons. The bounding boxes are part of the dataset andhave been collected using Amazon’s Mechanical Turk. Thismakes the evaluation focused on the gaze following task. InFigure 7 and 5 we provide some qualitative examples of oursystem working with head bounding boxes computed withan automatic head detector. For our quantitative evaluation,we report performances of the model in two tasks: predict-ing the gaze location given the frame with the object, andselecting the frame with the object.

    5.1.1 Predicting gaze location

    We use AUC, L2 distances and KL divergence as our eval-uation metrics for predicting gaze location. AUC refers toArea Under the Curve, a measure typically used to com-pare predicted distributions to samples. The predicted heatmap is used as a confidence to build a ROC curve. We used[15] to compute the AUC metric. We also used L2 metric,which is computed as the euclidean error between the pre-dicted point and the ground truth annotation. Additionally,we report minimum distance to human annotation, whichis the L2 distance for the closer ground truth point. Forcomparison purposes, we assume the images are normal-ized to having sides of length 1 unit. Finally, KL refers tothe Kullback-Leibler divergence, a measure of the informa-tion lost when the output map is used as the gaze fixationmap. KL is typically used to compare distributions [1].

    Previous work in gaze following in video cannot be eval-uated in our benchmark because of its particular contains(only predicting social interaction or using multi-modeldata). We compare our method to several baselines de-scribed below. For methods producing a single location, weused a Gaussian distribution centered in the output location.

    Random: The prediction is a random location in the im-age. Center: The prediction is always the center of the

    image. Fixed bias: The head location is quantized in a13 ⇥ 13 grid and the training set is used to compute theaverage output location per each head location. Saliency:The output heatmap is the saliency prediction for x

    t

    . [23]is used to compute the saliency map. The output point iscomputed as the mode of the saliency output distribution.Static Gaze: [26] is used to compute the gaze prediction.Since it is a method for static images, the head image andthe head location provided are from the source view but theimage provided is the target view.

    Additionally, we performed an analysis on the compo-nents of our model. With this analysis, we aim to under-stand the contribution of each of the parts to performanceas well as suggest that all of them are needed.

    Translation only: The affine transformation is a trans-lation. Rotation only: The affine transformation is a 3-axisrotation. Identity: The affine transformation is the identity.Image only: The saliency pathway is used to generate theoutput. Cone only: The gaze pathway combined with thetransformation pathway are used to generate the output. 3axis rotation / translation: The affine transformation is a 3axis rotation combined with a translation. Vertical axis ro-tation: The affine transformation is a rotation in the verticalaxis combined with a translation.

    5.1.2 Frame selection

    We use mean Average Precision as our evaluation metricfor the frame selection. AP is defined as the area under theprecision-recall curve and has been extensively used to eval-uate detection problems. As for predicting the gaze loca-tion, previous work in gaze-following cannot be applicableto solve the frame selection task. We compare our methodto the baselines described below.

    Random: The score for the frame is randomly assigned.Closest: The score is inverse to the time difference be-tween the source frame and the target frame. Saliency: Thescore assigned to the frame is inverse to the entropy of the

  • 0-20 -10 10 20 Frames30a)

    b)

    Frames

    0.8

    Ga

    ze s

    core

    0.5

    0

    c)

    0.7

    Ga

    ze s

    core 0.5

    00-20 -10 10 20 30 d)

    Figure 5: Full Results: We show two detailed examples of how our model works. In a) and c), we show the probabilitydistribution that our networks assigns to every frame in the video. Once the frame is selected, in b) and d) we show the finalgaze prediction of our network.

    Originalframe Targetframe Coneprojection Saliencymap FinalOutput

    Figure 6: Internal visualizations: We show examples of the output for the different pathways of our network. The coneprojection shows the final output of the cone-plane intersection module. The saliency map shows the output of the saliencypathway. The final output show the predicted gaze location distribution.

    saliency map [23]. This value is higher if the saliency mapis more concentrated, which could indicate the presence ofa salient object. Additionally, we compare against some ofthe ablation model defined in the previous section.5.2. Results

    Table 1 summarizes our performance on both tasks.

    5.2.1 Predicting gaze location

    Our model has a performance of 89.0 in AUC, 0.184 in L2,0.123 in minimum L2 distance and 7.76 in KL. Our perfor-mance is significantly better than all the baselines. Inter-estingly, the model with vertical rotation performs similarly(88.5/0.189/0.128/7.82), which we attribute to the fact thatmost of the camera rotations are on the vertical axis.

    Our analysis show that our model outperforms all possi-ble combinations of models and restricted transformations.We show that each component of the model is required toobtain good performance. Note that models generating onelocation perform worse in KL divergence because the met-ric is designed to evaluate distributions.

    In Figure 6 we show the output of the internal pathwaysof our model. This figure suggest that our network has in-ternally learned to solve the sub-problems we intended itto solve. In addition to solving the overall gaze followingproblem, the network is able to estimate the geometricalrelationship among frames along with estimating the gazedirection from the source view and predicting the salient re-gions in the target view.

  • Originalframe Selectedframe Detection

    Failu

    res

    Good

    Predictions

    Person

    Person

    Car

    Figure 7: Following a character: We follow a characterthrough a movie and list which elements he has seen duringthe film. Here we present three examples of our predictions.

    5.2.2 Frame selection

    The mean AP of our model is 87.5, over performing all thebaselines and ablation models. Interestingly, the model us-ing only the target frame performs significantly worse thanthe models using both source and target frames, showingthe need of using the source frame to retrieve the frame ofinterest. In Figure 5 we show two examples of the frame se-lection system. On the left, we show the source frame and,on the right, we show five frames. Below the frames weshow the frame selector network score. In the first example,it clearly selects the right frame. In the second example,which is more ambiguous, it selects the right frame as well.

    5.3. Combined model

    Figure 7 shows the output of our model using an auto-matic head detector (Face Recognition library in Python)and using the frame selector to select the frame. Further-more, we used [27] to detect and label the object the char-acter is looking at. Using our model, we can list the objectsthat the character has seen during a movie.

    Figure 5 presents two examples with out full pipeline. InFig. 5.a) and c) we show the frame selection score over time.As expected, the frames containing the person who is goingto be predicted have low score. Furthermore, frames likelyto contain the gazed object have higher score. In Fig. 5.b)and d) we plot the final prediction.

    5.4. Similarity analysis

    How different is our method to a saliency model andto the gaze model on a single image? One could arguethat when frames are different our system is simply do-ing saliency, and that when frames are similar you canuse the static method. In Fig. 8 we evaluate the per-formance of these models when varying the similarity be-tween the source and the target frame. We used groundtruth data annotated in AMT. We plot the performance of

    AUC

    L2Error

    KL

    MostSimilar LeastSimilar MostSimilar LeastSimilar

    MostSimilar LeastSimilar

    Figure 8: Similarity-performance representation: Weplot performance versus similarity of the target and thesource frame. Our model outperforms saliency and staticgaze-following in all the similarity range for all the metrics.

    our method, a static gaze-following method [26], a state-of-the-art saliency method [23] and humans. We outperformboth static gaze-following and saliency in all the similarityranges, showing that our model is doing more than just per-forming this two tasks combined. As mentioned in Sec. 5.2,humans perfom bad according to KL because the metric isdesigned to compare distributions and not locations.

    6. Conclusions

    We present a novel method for gaze following in video.Given one frame with a person on it, we are able to findthe frame where the person is looking and predict the gazelocation, even when the frames are quite different. We splitour model in three pathways which automatically learn tosolve the three sub problems involved in the task. We takeadvantage of the geometry of the scene to better predictpeople’s gaze. We also introduce a new dataset wherewe benchmark our model and show that it over performsthe baselines and produces meaningful outputs. We hopethat our dataset will attract the community attention to theproblem.

    Acknowledgements. We thank Z. Bylinskii for proof-reading. Funding for this research was partially supportedby the Obra Social la Caixa Fellowship to AR and Samsung.

    References

    [1] Z. Bylinskii*, T. Judd*, A. Oliva, A. Torralba, and F. Durand.What do different evaluation metrics tell us about saliencymodels? arXiv preprint arXiv:1604.03605, 2016. 6

  • [2] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba,and F. Durand. Where should saliency models look next? InECCV, pages 809–824. Springer, 2016. 3

    [3] S. Calinon and A. Billard. Teaching a humanoid robot to rec-ognize and reproduce social cues. In Proc. IEEE Intl Sympo-sium on Robot and Human Interactive Communication (Ro-Man), pages 346–351, September 2006. 4

    [4] I. Chakraborty, H. Cheng, and O. Javed. 3d visual proxemics:Recognizing human interactions in 3d from a single image.In CVPR, pages 3406–3413, 2013. 2

    [5] C.-Y. Chen and K. Grauman. Subjects and their objects: Lo-calizing interactees for a person-centric view of importance.IJCV, pages 1–22, 2016. 2

    [6] D. DeTone, T. Malisiewicz, and A. Rabinovich. Deep imagehomography estimation. arXiv preprint arXiv:1606.03798,2016. 3

    [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, andA. Zisserman. The PASCAL Visual Object Classes (VOC)Challenge. IJCV, 2010. 6

    [8] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interactions:A first-person perspective. In CVPR, 2012. 2

    [9] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize dailyactions using gaze. In ECCV. 2012. 2

    [10] A. Handa, M. Bloesch, V. Patraucean, S. Stent, J. McCor-mac, and A. Davison. gvnn: Neural network library for ge-ometric computer vision. arXiv preprint arXiv:1607.07405,2016. 3, 5

    [11] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transform-ing auto-encoders. In International Conference on ArtificialNeural Networks, pages 44–51. Springer, 2011. 3

    [12] G. F. Hinton. A parallel computation that assigns canoni-cal object-based frames of reference. In Proceedings of the7th international joint conference on Artificial intelligence-Volume 2, pages 683–685. Morgan Kaufmann PublishersInc., 1981. 3

    [13] M. Hoai and A. Zisserman. Talking heads: Detecting hu-mans and recognizing their interactions. In CVPR, pages875–882, 2014. 2

    [14] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In NIPS, pages 2017–2025, 2015. 3

    [15] T. Judd, K. Ehinger, F. Durand, and A. Torralba. Learning topredict where humans look. In CVPR, 2009. 6

    [16] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhan-darkar, W. Matusik, and A. Torralba. Eye tracking for every-one. In CVPR, 2016. 2

    [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012. 5

    [18] J. Li, Y. Tian, T. Huang, and W. Gao. A dataset and evalua-tion methodology for visual saliency in video. In 2009 IEEEInternational Conference on Multimedia and Expo, pages442–445. IEEE, 2009. 3

    [19] S. Li and M. Lee. Fast visual tracking using motion saliencyin video. In 2007 IEEE International Conference on Acous-tics, Speech and Signal Processing-ICASSP’07, volume 1,pages I–1073. IEEE, 2007. 3

    [20] M. J. Marı́n-Jiménez, A. Zisserman, M. Eichner, and V. Fer-rari. Detecting people looking at each other in videos. IJCV,106(3):282–296, 2014. 2

    [21] M. J. Marı́n-Jiménez, A. Zisserman, and V. Ferrari. Hereslooking at you, kid. Detecting people looking at each otherin videos. In BMVC, 5, 2011. 2

    [22] S. S. Mukherjee and N. M. Robertson. Deep head pose:Gaze-direction estimation in multimodal video. IEEE Trans-actions on Multimedia, 17(11):2094–2107, 2015. 2

    [23] J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E.O’Connor. Shallow and deep convolutional networks forsaliency prediction. In CVPR, June 2016. 6, 7, 8

    [24] H. Park, E. Jain, and Y. Sheikh. Predicting primary gazebehavior using social saliency fields. In ICCV, 2013. 2

    [25] D. Parks, A. Borji, and L. Itti. Augmented saliency modelusing automatic 3d head pose detection and learned gaze fol-lowing in natural scenes. Vision Research, 2014. 3

    [26] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba.Where are they looking? In NIPS, pages 199–207, 2015.2, 5, 6, 8

    [27] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In NIPS, 2015. 8

    [28] D. J. Rezende, S. Eslami, S. Mohamed, P. Battaglia,M. Jaderberg, and N. Heess. Unsupervised learning of 3dstructure from images. arXiv preprint arXiv:1607.00662,2016. 3

    [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.IJCV, 2015. 5

    [30] H. Soo Park and J. Shi. Social saliency prediction. In CVPR,2015. 2

    [31] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Ur-tasun, and S. Fidler. Movieqa: Understanding storiesin movies through question-answering. arXiv preprintarXiv:1512.02902, 2015. 3

    [32] S. Tripathi and B. Guenter. A statistical approach to con-tinuous self-calibrating eye gaze tracking for head-mountedvirtual reality systems. In WACV, 2017 IEEE Winter Confer-ence on, pages 862–870. IEEE, 2017. 2

    [33] S. Vascon, E. Z. Mequanint, M. Cristani, H. Hung,M. Pelillo, and V. Murino. A game-theoretic probabilisticapproach for detecting conversational groups. In Asian Con-ference on Computer Vision, pages 658–675. Springer, 2014.2

    [34] Y. Xia, R. Hu, Z. Huang, and Y. Su. A novel method forgeneration of motion saliency. In 2010 IEEE InternationalConference on Image Processing, pages 4685–4688. IEEE,2010. 3

    [35] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. Appearance-based gaze estimation in the wild. In CVPR, pages 4511–4520, 2015. 2