MediaEval 2015 - RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach

MediaEval 2015

RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach

Ionuț Mironică1

[email protected]

Bogdan Ionescu1

[email protected]

Mats Sjöberg2

[email protected]

Markus Schedl3

[email protected]

Marcin Skowron4

[email protected]

Romania Finland Austria The 2015 Affective Impact of Movies Task (includes Violent Scenes Detection)

2 Helsinki Institute for Information Technology HIIT University of Helsinki,

University POLITEHNICA of Bucharest

1 3 Austrian Research Institute for Artificial Intelligence, Vienna, Austria

4

MediaEval 2015 2

Presentation outline

•  Global approach

•  Video content description

•  Experimental results

•  Conclusions

MediaEval 2015 3

> challenge: find a way to assign violence estimation tags to unknown videos;

> approach: machine learning paradigm;

Global Approach

labeled data

unlabeled data

train

extract features

estimate new videos

MediaEval 2015

objective 2: test a broad range of frame aggregation techniques

•  We focus on:

Global Approach

objective 3: test several fusion techniques

objective 1: go multimodal

visual audio motion

4

MediaEval 2015

Global Approach

Extract features Frame aggregation Global video features Train classifier

•  Bag of Words •  Fisher kernel •  Vector of Locally Aggregated Descriptors

5

MediaEval 2015

Video Content Description

[K. Seyerlehner et al., SMC, 2010]

Audio features

f1 fn … f2

time

•  Block-based audio features

Motion features

[J. Uijlings et al., IJMIR, 2014]

•  3D-HoG / 3D-HOF features

6

MediaEval 2015

Video Content Description

Visual features

[K. E. A. van de Sande et al., TPAMI, 2010]

•  ColorSIFT features

•  CENsus Transform hISTogram (CENTRIST)

[J. Wu et al., TPAMI, 2011]

•  CNN features

[J. Krizhevsky et al., NIPS, 2011]

7

MediaEval 2015 8

Evaluation

(1) Performance on Violence Detection Task

- the best performance is used with Fisher kernel and CNN visual features

- fusing all the features together did not improve the results above the FK-CNN only result

Description MAP Run 1 Average on audio descriptors & nonlinear SVM 0.0485

Run 2 Average on visual features & nonlinear SVM 0.0452

Run 3 Modified VLAD with motion features & linear SVM 0.0768

Run 4 Fisher kernel with CNN visual features 0.1419

Run 5 Late fusion between all the previous runs 0.0824

MediaEval 2015 9

Evaluation

(2) Performance on Emotional Impact of Movies Task

Description Accuracy valence

Accuracy arousal

Run 1 Average on audio descriptors & nonlinear SVM 33.032% 45.038%

Run 2 Average on visual features & nonlinear SVM 36.123% 34.104%

Run 3 Modified VLAD with motion features & linear SVM 29.731% 39.865%

Run 4 Fisher kernel with CNN visual features 30.320% 44.365%

Run 5 Late fusion between all the previous runs 29.752% 37.595%

MediaEval 2015 10

Conclusions •  we obtained the best results on the violence task by using motion and visual

features;

•  the visual / motion features obtained lower results for both valence and arousal predictions.

•  on the other side, we obtained the best results on the a affect task using the audio features only;

MediaEval 2015 11

Thank you!

Questions?

MediaEval 2015 - RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach

Education

Transcript of MediaEval 2015 - RFA at MediaEval 2015 Affective Impact of Movies Task: A Multimodal Approach