Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework

for Video Classification

Zuxuan Wu, Xi Wang, Yu-Gang Jiang,

Hao Ye, Xiangyang Xue

1

School of Computer Science, Fudan University, Shanghai, China

ACM Multimedia, Brisbane, Australia, Oct., 2015

[email protected]

2

Video Classification

• Videos are everywhere

• Wide applications Web video search Video collection management Intelligent video surveillance

3

Video Classification: State-of-the-Arts

1. Improved Dense Trajectories [Wang et al., ICCV 2013]

a) Tracking trajectoriesb) Computing local descriptors along the trajectories

2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015]

a) Encoding local features with Fisher Vector/VLADb) Normalization methods, such as Power Norm

2. Two-Stream CNN [Simonyan et al., NIPS 2014]

Video Classification: Deep Learning

1. Image-based CNN Classification [Zha et al., arXiv 2015]

a) Extracting deep features for each frameb) Averaging frame-level deep features


Falling into water

Div

ing

Jumping from platform

Rotating in the air

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance is not ideal， same as image-based classification.

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Ot-1 Ot Ot+1


Jumping from platform

Rotating in the air

Falling into water

Div

ing

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance of LSTM and average pooling is close.

We propose a hybrid deep learning framework to capture appearance, short-term motion and

long-term temporal dynamics in videos.

7Input Video

Final Prediction

Individual Frames

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN

1sy 2

sy 3sy 1

sT y s

Ty

Stacked Optical Flow

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN

1my 2

my 3my 1

mT y m

Ty

EsW E

mW

E

Fusion Layer

Our FrameworkWe propose a hybrid deep learning framework to model rich multimodal information:

a) Appearance, shot-term motion with CNN

b) Long-term temporal information with LSTM

c) Regularized fusion to explore feature correlations

Regularzation

8

Spatial and Motion CNN Features

Spatial Convolutional Neural NetworkIndividual

Frame

Motion Convolutional Neural Network

Score Fusion

StackedOptical Flow

Inp

ut

Vid

eo

9

Temporal Modeling with LSTM

An unrolled recurrent neural network.

10

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

11

DNN Learning Scheme

- Calculate prediction error- Update weights in a BP manner ( )w t ( 1)w t

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

The fusion is performed in a free manner without explicitly exploring the feature correlations.

12

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

13


Prevent overfitting


Empiricalloss

Provide Robustness

14


Prevent overfitting


Empiricalloss

Provide Robustness

Minimizing the l21 norm will make the matrix be row-sparse!

15


Prevent overfitting


Empiricalloss

Provide Robustness

Minimizing the l1 norm will prevent incorrect feature sharing!

16


Prevent overfitting


Empiricalloss

Provide Robustness

Optimization:

For the E-th layer: Proximal gradient descent

17

Regularized Feature FusionAlgorithm:

1. Initialize weights randomly

2. for epoch = 1: K

① Calculate prediction error with feed forward propagation.

for l = 1: L

② Back propagate the prediction error and update weight

matrices

③ if L == E: Evaluating the proximal operator

end for

18

ExperimentsDatasets:- UCF101: 101 action classes, 13,320 video clips from YouTube

- Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube

19

ExperimentsTemporal Modeling:

UCF-101 CCV

Spatial ConvNet 80.4 75.0

Motion ConvNet 78.3 59.1

Spatial LSTM 83.3 43.3

Motion LSTM 76.6 54.7

ConvNet (spatial + motion) 86.2 75.8

LSTM (spatial + motion) 86.3 61.9

ConvNet + LSTM (spatial) 84.4 77.9

ConvNet + LSTM (motion) 81.4 70.9

ALL Streams 90.3 82.4

LSTM are worse than CNN on noisy long videos.CNN and LSTM are highly complementary!

20

ExperimentsRegularized Feature Fusion:

Regularized fusion performs better compared with fusion in a free manner.

UCF-101 CCV

Spatial SVM 78.6 74.4

Motion SVM 78.2 57.9

SVM-EF 86.6 75.3

SVM-LF 85.3 74.9

SVM-MKL 86.8 75.4

NN-EF 86.5 75.6

NN-LF 85.1 75.2

M-DBM 86.9 75.3

Two-Stream CNN 86.2 75.8

Regularized Fusion 88.4 76.2%

21

ExperimentsHybrid Deep Learning Framework:

22

ExperimentsComparisons with State-of-the-Art:

CCVXu et al. 60.3%Ye et al. 64.0%Jhuo et al. 64.0%Ma et al. 63.4%Liu et al. 68.2%Wu et al. 70.6% Ours 83.5%

UCF101Donahue et al. 82.9%Srivastava et al. 84.3%Wang et al. 85.9%Tran et al. 86.7%Simonyan et al. 88.0%Lan et al. 89.1%Zha et al. 89.6%Ours 91.3%

23

Conclusion

We propose a hybrid deep learning framework to model rich

multimodal information:

1. Modeling appearance, shot-term motion with CNN

2. Capturing long-term temporal information with LSTM

3. Regularized fusion to explore feature correlations

Take-home message:1. LSTMs and CNNs are highly complementary2. Regularized feature fusion performs better.

24

Thank you!Q & A

[email protected]

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...

Documents

Transcript of Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...