MediaEval 2015 - RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach
-
Upload
multimediaeval -
Category
Education
-
view
82 -
download
1
Transcript of MediaEval 2015 - RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach
MediaEval 2015
RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach
Ionuț Mironică1
Bogdan Ionescu1
Mats Sjöberg2
Markus Schedl3
Marcin Skowron4
Romania Finland Austria The 2015 Affective Impact of Movies Task (includes Violent Scenes Detection)
2 Helsinki Institute for Information Technology HIIT University of Helsinki,
University POLITEHNICA of Bucharest
1 3 Austrian Research Institute for Artificial Intelligence, Vienna, Austria
4
MediaEval 2015 2
Presentation outline
• Global approach
• Video content description
• Experimental results
• Conclusions
MediaEval 2015 3
> challenge: find a way to assign violence estimation tags to unknown videos;
> approach: machine learning paradigm;
Global Approach
labeled data
unlabeled data
train
extract features
estimate new videos
MediaEval 2015
objective 2: test a broad range of frame aggregation techniques
• We focus on:
Global Approach
objective 3: test several fusion techniques
objective 1: go multimodal
visual audio motion
4
MediaEval 2015
Global Approach
Extract features Frame aggregation Global video features Train classifier
• Bag of Words • Fisher kernel • Vector of Locally Aggregated Descriptors
5
MediaEval 2015
Video Content Description
[K. Seyerlehner et al., SMC, 2010]
Audio features
f1 fn … f2
time
• Block-based audio features
Motion features
[J. Uijlings et al., IJMIR, 2014]
• 3D-HoG / 3D-HOF features
6
MediaEval 2015
Video Content Description
Visual features
[K. E. A. van de Sande et al., TPAMI, 2010]
• ColorSIFT features
• CENsus Transform hISTogram (CENTRIST)
[J. Wu et al., TPAMI, 2011]
• CNN features
[J. Krizhevsky et al., NIPS, 2011]
7
MediaEval 2015 8
Evaluation
(1) Performance on Violence Detection Task
- the best performance is used with Fisher kernel and CNN visual features
- fusing all the features together did not improve the results above the FK-CNN only result
Description MAP Run 1 Average on audio descriptors & nonlinear SVM 0.0485
Run 2 Average on visual features & nonlinear SVM 0.0452
Run 3 Modified VLAD with motion features & linear SVM 0.0768
Run 4 Fisher kernel with CNN visual features 0.1419
Run 5 Late fusion between all the previous runs 0.0824
MediaEval 2015 9
Evaluation
(2) Performance on Emotional Impact of Movies Task
Description Accuracy valence
Accuracy arousal
Run 1 Average on audio descriptors & nonlinear SVM 33.032% 45.038%
Run 2 Average on visual features & nonlinear SVM 36.123% 34.104%
Run 3 Modified VLAD with motion features & linear SVM 29.731% 39.865%
Run 4 Fisher kernel with CNN visual features 30.320% 44.365%
Run 5 Late fusion between all the previous runs 29.752% 37.595%
MediaEval 2015 10
Conclusions • we obtained the best results on the violence task by using motion and visual
features;
• the visual / motion features obtained lower results for both valence and arousal predictions.
• on the other side, we obtained the best results on the a affect task using the audio features only;