Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks
Click here to load reader
-
Upload
multimediaeval -
Category
Software
-
view
103 -
download
0
description
Transcript of Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks
![Page 1: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/1.jpg)
FUDAN-NJUST AT MEDIAEVAL 2014: VIOLENT SCENES DETECTION USING
DEEP NEURAL NETWORKS
Qi Dai*, Zuxuan Wu*, Yu-Gang Jiang*, Xiangyang Xue*, Jinhui Tang#
*School of Computer Science, Fudan University, Shanghai, China#School of Computer Science and Engineering, Nanjing University of Science and Technology,
Nanjing, China
MediaEval 2014 Workshop, Oct 16-17, Barcelona, Spain
![Page 2: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/2.jpg)
Problem• Detecting violent scenes in both movies and short web videos
Violent scene in Hollywood Movies Violent scene in short web videos
![Page 3: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/3.jpg)
System Overview• Several features were used,
including trajectory-based features, and two other visual/audio features
• In addition to SVM, we adopted deep neural networks (DNN) as a feature fusion and classification method
FV
-HO
G
Video Clips
11
SVM DNN
MergingMerging
22
55 33 44
Feature Extraction
FV
-HO
F
FV
-MB
H
FV
-Tra
jSha
pe
Tra
jMF
-HO
G
Tra
jMF
-HO
F
Tra
jMF
-MB
H
ST
IP
MF
CC
Fusion
MergingMerging
MergingMerging Smoothing&Merging
Smoothing&Merging
![Page 4: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/4.jpg)
Features• Trajectory-based features:
Improved Dense Trajectories (HOG, HOF, MBH, Trajectory Shape)
Features are encoded by the Fisher Vectors (FV)
Dimension-reduced TrajMF (relative locations and motions between trajectory pairs), implemented based on:
• Two additional visual/audio features (a complement to the trajectory-based features):
Spatial-temporal interest points
Audio MFCC
Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling
of human actions with motion reference points. In ECCV, 2012.
Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo. Trajectory-based modeling
of human actions with motion reference points. In ECCV, 2012.
![Page 5: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/5.jpg)
Classifiers• SVM
Chi-square kernel for STIP and MFCC; Linear kernel for the other features
Kernel fusion within trajectory-based features; Score-level late fusion to combine trajectory-based features with STIP and MFCC
• Regularized DNN (ACM Multimedia 2014 full paper) Multiple hand-crafted features are used as inputs
Fusing features in a more rigorous fashion by considering both feature correlation and feature diversity
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
![Page 6: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/6.jpg)
Classification
Classification
Feature Fusion
Feature Abstraction
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, X. Xue, Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification, In ACM Multimedia Orlando, USA, Nov. 2014. (Full Paper)
![Page 7: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/7.jpg)
Results (MAP2014)
• Run 1: SVM + Score Merging• Run 2: DNN + Merging• Run 3: SVM + DNN + Merging• Run 4: SVM + DNN + Smoothing + Merging• Run 5: SVM + DNN • Note: for DNN, we used less features (excluding the FV encoding of HOG, HOF, and MBH)
Run 1 Run 2 Run 3 Run 4 Run 5
0.410.45
0.4
0.63
0.510.49
0.6
0.540.5
0.55
Main Task Generalization Task
![Page 8: Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks](https://reader038.fdocuments.us/reader038/viewer/2022100605/559c66ab1a28ab73488b45b0/html5/thumbnails/8.jpg)
Observations• DNN is significantly better than SVM, even when some features were not used in DNN.
• Directly fusing SVM and DNN incurs a small perfor-mance drop. Better result may be obtained after parameter optimization (needs more investigation).
• Smoothing and merging are useful, but the correct order of using the two (i.e. which is used first) and their contributions need more experiments to be fully understood.
• Some conclusions drawn from the main task do not generalize to the generalization task. This also requires more investigations to come to a more concrete understanding of the problem.