Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...

24
Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu , Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1 School of Computer Science, Fudan University, Shanghai, China ACM Multimedia, Brisbane, Australia, Oct., 2015 [email protected]

Transcript of Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan...

Page 1: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework

for Video Classification

Zuxuan Wu, Xi Wang, Yu-Gang Jiang,

Hao Ye, Xiangyang Xue

1

School of Computer Science, Fudan University, Shanghai, China

ACM Multimedia, Brisbane, Australia, Oct., 2015

[email protected]

Page 2: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

2

Video Classification

• Videos are everywhere

• Wide applications Web video search Video collection management Intelligent video surveillance

Page 3: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

3

Video Classification: State-of-the-Arts

1. Improved Dense Trajectories [Wang et al., ICCV 2013]

a) Tracking trajectoriesb) Computing local descriptors along the trajectories

2. Feature Encoding [Perronnin et al., CVPR 2010, Xu et al., CVPR 2015]

a) Encoding local features with Fisher Vector/VLADb) Normalization methods, such as Power Norm

Page 4: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

2. Two-Stream CNN [Simonyan et al., NIPS 2014]

Video Classification: Deep Learning

1. Image-based CNN Classification [Zha et al., arXiv 2015]

a) Extracting deep features for each frameb) Averaging frame-level deep features

Page 5: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

Video Classification: Deep Learning

Falling into water

Div

ing

Jumping from platform

Rotating in the air

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance is not ideal, same as image-based classification.

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Ot-1 Ot Ot+1

Page 6: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

Video Classification: Deep Learning

Jumping from platform

Rotating in the air

Falling into water

Div

ing

[Ng et al., CVPR 2015]3. Recurrent NN: LSTM

The performance of LSTM and average pooling is close.

We propose a hybrid deep learning framework to capture appearance, short-term motion and

long-term temporal dynamics in videos.

Page 7: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

7Input Video

Final Prediction

Individual Frames

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN

1sy 2

sy 3sy 1

sT y s

Ty

Stacked Optical Flow

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN

1my 2

my 3my 1

mT y m

Ty

EsW E

mW

E

Fusion Layer

Our FrameworkWe propose a hybrid deep learning framework to model rich multimodal information:

a) Appearance, shot-term motion with CNN

b) Long-term temporal information with LSTM

c) Regularized fusion to explore feature correlations

Regularzation

Page 8: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

8

Spatial and Motion CNN Features

Spatial Convolutional Neural NetworkIndividual

Frame

Motion Convolutional Neural Network

Score Fusion

StackedOptical Flow

Inp

ut

Vid

eo

Page 9: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

9

Temporal Modeling with LSTM

An unrolled recurrent neural network.

Page 10: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

10

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

Page 11: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

11

DNN Learning Scheme

- Calculate prediction error- Update weights in a BP manner ( )w t ( 1)w t

Regularized Feature Fusion

[Ngiam et al., ICML 2011 Srivastava et al., NIPS 2012]

The fusion is performed in a free manner without explicitly exploring the feature correlations.

Page 12: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

12

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Page 13: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

13

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Page 14: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

14

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Minimizing the l21 norm will make the matrix be row-sparse!

Page 15: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

15

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Minimizing the l1 norm will prevent incorrect feature sharing!

Page 16: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

16

Regularized Feature FusionObjective function:

Prevent overfitting

Model feature relationships

Empiricalloss

Provide Robustness

Optimization:

For the E-th layer: Proximal gradient descent

Page 17: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

17

Regularized Feature FusionAlgorithm:

1. Initialize weights randomly

2. for epoch = 1: K

① Calculate prediction error with feed forward propagation.

for l = 1: L

② Back propagate the prediction error and update weight

matrices

③ if L == E: Evaluating the proximal operator

end for

Page 18: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

18

ExperimentsDatasets:- UCF101: 101 action classes, 13,320 video clips from YouTube

- Columbia Consumer Videos (CCV): 20 classes, 9,317 videos from YouTube

Page 19: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

19

ExperimentsTemporal Modeling:

UCF-101 CCV

Spatial ConvNet 80.4 75.0

Motion ConvNet 78.3 59.1

Spatial LSTM 83.3 43.3

Motion LSTM 76.6 54.7

ConvNet (spatial + motion) 86.2 75.8

LSTM (spatial + motion) 86.3 61.9

ConvNet + LSTM (spatial) 84.4 77.9

ConvNet + LSTM (motion) 81.4 70.9

ALL Streams 90.3 82.4

LSTM are worse than CNN on noisy long videos.CNN and LSTM are highly complementary!

Page 20: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

20

ExperimentsRegularized Feature Fusion:

Regularized fusion performs better compared with fusion in a free manner.

UCF-101 CCV

Spatial SVM 78.6 74.4

Motion SVM 78.2 57.9

SVM-EF 86.6 75.3

SVM-LF 85.3 74.9

SVM-MKL 86.8 75.4

NN-EF 86.5 75.6

NN-LF 85.1 75.2

M-DBM 86.9 75.3

Two-Stream CNN 86.2 75.8

Regularized Fusion 88.4 76.2%

Page 21: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

21

ExperimentsHybrid Deep Learning Framework:

Page 22: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

22

ExperimentsComparisons with State-of-the-Art:

CCVXu et al. 60.3%Ye et al. 64.0%Jhuo et al. 64.0%Ma et al. 63.4%Liu et al. 68.2%Wu et al. 70.6% Ours 83.5%

UCF101Donahue et al. 82.9%Srivastava et al. 84.3%Wang et al. 85.9%Tran et al. 86.7%Simonyan et al. 88.0%Lan et al. 89.1%Zha et al. 89.6%Ours 91.3%

Page 23: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

23

Conclusion

We propose a hybrid deep learning framework to model rich

multimodal information:

1. Modeling appearance, shot-term motion with CNN

2. Capturing long-term temporal information with LSTM

3. Regularized fusion to explore feature correlations

Take-home message:1. LSTMs and CNNs are highly complementary2. Regularized feature fusion performs better.

Page 24: Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue 1.

24

Thank you!Q & A

[email protected]