Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...
-
Upload
jessie-watkins -
Category
Documents
-
view
281 -
download
1
Transcript of Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework
for Video Classification
Zuxuan Wu, Xi Wang, Yu-Gang Jiang,
Hao Ye, Xiangyang Xue
1
School of Computer Science, Fudan University, Shanghai, China
ACM Multimedia, Brisbane, Australia, Oct., 2015
2
Video Classification
• Videos are everywhere
• Wide applications Web video search Video collection management Intelligent video surveillance
3
Video Classification: State-of-the-Arts
1. Improved Dense Trajectories [Wang et al., ICCV 2013]
a) Tracking trajectoriesb) Computing local descriptors along the trajectories
2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015]
a) Encoding local features with Fisher Vector/VLADb) Normalization methods, such as Power Norm
2. Two-Stream CNN [Simonyan et al., NIPS 2014]
Video Classification: Deep Learning
1. Image-based CNN Classification [Zha et al., arXiv 2015]
a) Extracting deep features for each frameb) Averaging frame-level deep features
Video Classification: Deep Learning
Falling into water
Div
ing
Jumping from platform
Rotating in the air
[Ng et al., CVPR 2015]3. Recurrent NN: LSTM
The performance is not ideal, same as image-based classification.
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Ot-1 Ot Ot+1
Video Classification: Deep Learning
Jumping from platform
Rotating in the air
Falling into water
Div
ing
[Ng et al., CVPR 2015]3. Recurrent NN: LSTM
The performance of LSTM and average pooling is close.
We propose a hybrid deep learning framework to capture appearance, short-term motion and
long-term temporal dynamics in videos.
7Input Video
Final Prediction
Individual Frames
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN
1sy 2
sy 3sy 1
sT y s
Ty
Stacked Optical Flow
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN
1my 2
my 3my 1
mT y m
Ty
EsW E
mW
E
Fusion Layer
Our FrameworkWe propose a hybrid deep learning framework to model rich multimodal information:
a) Appearance, shot-term motion with CNN
b) Long-term temporal information with LSTM
c) Regularized fusion to explore feature correlations
Regularzation
8
Spatial and Motion CNN Features
Spatial Convolutional Neural NetworkIndividual
Frame
Motion Convolutional Neural Network
Score Fusion
StackedOptical Flow
Inp
ut
Vid
eo
9
Temporal Modeling with LSTM
An unrolled recurrent neural network.
10
Regularized Feature Fusion
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
11
DNN Learning Scheme
- Calculate prediction error- Update weights in a BP manner ( )w t ( 1)w t
Regularized Feature Fusion
[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]
The fusion is performed in a free manner without explicitly exploring the feature correlations.
12
Regularized Feature FusionObjective function:
Prevent overfitting
Model feature relationships
Empiricalloss
Provide Robustness
13
Regularized Feature FusionObjective function:
Prevent overfitting
Model feature relationships
Empiricalloss
Provide Robustness
14
Regularized Feature FusionObjective function:
Prevent overfitting
Model feature relationships
Empiricalloss
Provide Robustness
Minimizing the l21 norm will make the matrix be row-sparse!
15
Regularized Feature FusionObjective function:
Prevent overfitting
Model feature relationships
Empiricalloss
Provide Robustness
Minimizing the l1 norm will prevent incorrect feature sharing!
16
Regularized Feature FusionObjective function:
Prevent overfitting
Model feature relationships
Empiricalloss
Provide Robustness
Optimization:
For the E-th layer: Proximal gradient descent
17
Regularized Feature FusionAlgorithm:
1. Initialize weights randomly
2. for epoch = 1: K
① Calculate prediction error with feed forward propagation.
for l = 1: L
② Back propagate the prediction error and update weight
matrices
③ if L == E: Evaluating the proximal operator
end for
18
ExperimentsDatasets:- UCF101: 101 action classes, 13,320 video clips from YouTube
- Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube
19
ExperimentsTemporal Modeling:
UCF-101 CCV
Spatial ConvNet 80.4 75.0
Motion ConvNet 78.3 59.1
Spatial LSTM 83.3 43.3
Motion LSTM 76.6 54.7
ConvNet (spatial + motion) 86.2 75.8
LSTM (spatial + motion) 86.3 61.9
ConvNet + LSTM (spatial) 84.4 77.9
ConvNet + LSTM (motion) 81.4 70.9
ALL Streams 90.3 82.4
LSTM are worse than CNN on noisy long videos.CNN and LSTM are highly complementary!
20
ExperimentsRegularized Feature Fusion:
Regularized fusion performs better compared with fusion in a free manner.
UCF-101 CCV
Spatial SVM 78.6 74.4
Motion SVM 78.2 57.9
SVM-EF 86.6 75.3
SVM-LF 85.3 74.9
SVM-MKL 86.8 75.4
NN-EF 86.5 75.6
NN-LF 85.1 75.2
M-DBM 86.9 75.3
Two-Stream CNN 86.2 75.8
Regularized Fusion 88.4 76.2%
21
ExperimentsHybrid Deep Learning Framework:
22
ExperimentsComparisons with State-of-the-Art:
CCVXu et al. 60.3%Ye et al. 64.0%Jhuo et al. 64.0%Ma et al. 63.4%Liu et al. 68.2%Wu et al. 70.6% Ours 83.5%
UCF101Donahue et al. 82.9%Srivastava et al. 84.3%Wang et al. 85.9%Tran et al. 86.7%Simonyan et al. 88.0%Lan et al. 89.1%Zha et al. 89.6%Ours 91.3%
23
Conclusion
We propose a hybrid deep learning framework to model rich
multimodal information:
1. Modeling appearance, shot-term motion with CNN
2. Capturing long-term temporal information with LSTM
3. Regularized fusion to explore feature correlations
Take-home message:1. LSTMs and CNNs are highly complementary2. Regularized feature fusion performs better.