Describing Videos by Exploiting Temporal Structure
Slides by Alberto MontesComputer Vision Group, April 12th, 2016
[arXiv] [GitXiv] [video] [code]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christoper Pal, Hugo Larochelle, Aaron Courville
Introduction
Introduction
Goal: Generate captions from videos.
Video Description Generation Framework
Encoder-Decoder Framework
Encoder: Convolutional Neural Network
Basic approach:
Deep CNN over frames
Decoder: Long Short-Term Memory Network
Long Short Term Memory
Long Short Term Memory
Forget Gate:
Long Short Term Memory
Input Gate Layer
New candidates for cell state
Long Short Term Memory
Update Memory Content:
Long Short Term Memory
E[yt]: word embedding matrix
inputprevious
hidden stateWeights matrices:context from
encoder bias
Exploiting Temporal Structure
Exploiting Local Features
● Trained for activity recognition.● Only the conv layers will be used.
Histograms of oriented Gradient
Histograms of oriented Flow
Motion Boundary Histogram
A Spatio-Temporal Convolution Neural Net
Exploiting Global Structure
Attention Mechanism
Update of attention weights:
Experiments
YouTube2Text
1,970 video clips with multiple descriptions
Training set: 1,200 video clips
Validation set: 100 video clips
Datasets
DVS
Videos taken from DVDs
49,000 video clips
Training set: 39,000 video clips
Validation set: 5,000
Test set: 5,000
Setup and Training
4 setups:
◉ Basic (2D GoogLeNet CNN)◉ Local (+ 3D CNN features)◉ Global (+ temporal attention
mechanism)◉ Local + Global
Training
- Adadelta gradient- Loss function:
Results
Evaluation
Evaluation
Conclusions
Propose a 3D CNN to capture local fine-grained motion information.
A temporal attention mechanism to capture global information.
State-of-the-art results on Youtube2text with a combination of both approaches.
“
Thank you!Questions?
Top Related