WhatMakesaVideoaVideo:AnalyzingTemporal Informationin...

1
What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets De-An Huang 1 , Vignesh Ramanathan 2 , Dhruv Mahajan 2 , Lorenzo Torresani 2 , Manohar Paluri 2 , Li Fei-Fei 1 , Juan Carlos Niebles 1 Stanford University 1 , Facebook 2 Motivation Class-Agnostic Temporal Generator Analysis Ø Videos contain much more than just the images Ø Still missing an explicit analysis of temporal information Ø Analyze the video model trained on a dataset (fixed weights) Ø Propose three frameworks to ablate temporal info from test video Ø Single frame is just an image and contains no temporal information (b) Video matching C3D deep features of (a) (a) Original Video Approach Overview 0 10 20 30 40 50 60 70 80 90 Original Video No Temporal Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 C3D trained on UCF101 Test Video Selected Frame Subsampling Frame Selector Temporal Generator Generated Video Generator Selector 6% Ø Temporal Dist Shift: Model has not seen “static videos” in training Ø Generate a video from the frame to bridge the distribution shift but without using any ”real” temporal information Ø Learning the Temporal Generator: The video generated from the image should be perceptually similar to the original video for the model Ø Key frame for us to recognize the action without temporal information Ø ! " : Estimate of frame quality Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 C3D trained on UCF101 Test Video Middle Frame Replicated Frames Replicate Frames Middle Frame Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 C3D trained on UCF101 Test Video Middle Frame Middle Frame Temporal Generator Generated Video Naïve Subsampling Video Model (C3D) Input Video Selected Frame Generated Video Temporal Generator Subsampling $ % & ' ( Motion-Invariant Frame Selector ! ) * = max / 0 / () * ) 0 / () * ) : score of class 3 Input Video Sub-sampled Frame Candidates ) $ ) * ) 4 !(") !(") !(") argmax Ø Oracle Key Frames (Upper Bound): select the frames that can give correct prediction Ø Analyzing Motion Information Ø 40% of UCF101 and 35% of Kinetics classes do not need motion Ø Temporal Generator: Ø Frame Selection: Ø Oracle Fame Selection JuggleBalls Original Vid JuggleBalls Temp. Gen. PlayFlute Original Vid PlayFlute Temp. Gen. Sled Dog Racing Ice Skating Boxing speedbag Ski Jumping

Transcript of WhatMakesaVideoaVideo:AnalyzingTemporal Informationin...

Page 1: WhatMakesaVideoaVideo:AnalyzingTemporal Informationin ...ai.stanford.edu/~jniebles/cvpr18-fb-poster_web.pdf · TestVideo Middle Frame Replicated Frames Replicate Frames Middle Frame

What Makes a Video a Video: Analyzing Temporal Information inVideo Understanding Models and Datasets

De-An Huang1, Vignesh Ramanathan2, Dhruv Mahajan2, Lorenzo Torresani2, Manohar Paluri2, Li Fei-Fei1, Juan Carlos Niebles1

Stanford University1, Facebook2

Motivation Class-Agnostic Temporal Generator AnalysisØ Videos contain much more than just the imagesØ Still missing an explicit analysis of temporal information

Ø Analyze the video model trained on a dataset (fixed weights)Ø Propose three frameworks to ablate temporal info from test video

Ø Single frame is just an image and contains no temporal information

(b) Video matching C3D deep features of (a)(a) Original Video

Approach Overview

0 10 20 30 40 50 60 70 80 90

Original Video

No Temporal

Conv

1

Conv

2

Conv

3

Conv

4

Conv

5

C3D trainedon UCF101

Test Video SelectedFrame

Subsampling

FrameSelector

TemporalGenerator

GeneratedVideoGenerator

Selector

6%

Ø Temporal Dist Shift: Model has not seen “static videos” in trainingØ Generate a video from the frame to bridge the distribution shift but

without using any ”real” temporal information

Ø Learning the Temporal Generator: The video generated from the imageshould be perceptually similar to the original video for the model

Ø Key frame for us to recognize the action without temporal informationØ ! " : Estimate of frame quality

Conv

1

Conv

2

Conv

3

Conv

4

Conv

5

C3D trainedon UCF101

Test Video MiddleFrame

ReplicatedFrames

ReplicateFrames

MiddleFrame

Conv

1

Conv

2

Conv

3

Conv

4

Conv

5

C3D trainedon UCF101

Test Video MiddleFrame

MiddleFrame

TemporalGenerator

GeneratedVideo

Naïve Subsampling

Video Model (C3D)Input Video

SelectedFrame

GeneratedVideo

TemporalGenerator

Subsampling

ℓ$ ℓ% ℓ& ℓ' ℓ(

Motion-Invariant Frame Selector

! )* = max/ 0/()*)0/()*) : score of class 3

Input Video

Sub-sampledFrame Candidates

……

)$

)*

)4

!(")

!(")

!(")

argmaxØ Oracle Key Frames (UpperBound): select the framesthat can give correctprediction

Ø Analyzing Motion Information

Ø 40% of UCF101 and 35% of Kinetics classes do not need motion

Ø

Ø Temporal Generator:

Ø Frame Selection:

Ø Oracle Fame Selection

JuggleBallsOriginal Vid

JuggleBallsTemp. Gen.

PlayFluteOriginal Vid

PlayFluteTemp. Gen.

Sled

Dog

R

acin

gIc

e Sk

atin

gB

oxin

gsp

eedb

agSk

iJu

mpi

ng