Tagging Video Contents with Positive/Negative Interest Based on User’s Facial Expression Masanori...

Tagging Video Contents with Positive/Negative Interest

Based on User’s Facial Expression

Masanori Miyahara, Masaki Aoki, Tetsuya Takiguchi, Yasuo Ariki

Graduate School of Engineering, Kobe University, Japan

The 14th International MultiMedia Modeling Conference (MMM2008)

9-11 January, 2008 Kyoto University, Japan

2

Table of Contents

1. Motivation

2. Proposed System

3. Experiments

4. Conclusion and Future Work

3

Introduction

Multichannel digital broadcasting on TV Video-sharing sites on the Internet

There are too many videos for viewers to select

Video contents recommendation system

　　⇒　 Tagging video contents

4

Video contents recommendation system

（ User analysis）Remote control operation histories[1]

Favorite keywords[2]

Facial expression[3]

（ Content analysis）Motion vector

Color distribution

Object recognition

Tagged contentsdatabase

[1]2001,Taka [2]2001,Masumitsu [3]2006,Yamamoto [4]1994, Resnick

（ Contents recommendation ）Collaborative Filtering[4]

（ Contents recommendation ）Collaborative Filtering[4]

5

Main contribution

⇒Estimate Neutral, Positive or Negative

⇒Reject noisy frames automatically

Conventional tagging system based on a viewer’s facial expression [2006,Yamamoto]

Estimate Interest or Neutral

Simple facial feature points

Noisy frame (tilting the face, occlusion)

⇒Allocate more feature points ⇒

reject

⇒extract facial feature points by EBGM

6

Experimental environment

User

Webcam

Display

PC

The viewer looks at the display alone

The viewer’s face is recorded into video by webcam

A PC plays back the video and also analyzes the viewer’s face

Top view of experimental environment

7

Overview of proposed system

Personal model

EBGM

SVM

Neutral image

Personal facial expression classifier

Person recognition

AdaBoostFacial expression

recognition

・ Neutral

・ Positive

・Negative

・Rejective

Tag

Facial feature point extraction

8

Face regions extraction using AdaBoost

Extract face regions using AdaBoost based on Haar-like feature [2001,Viola]

The face size is normalized

Reduce computation time in the next process

merits

9

Facial feature point extraction and person recognition using EBGM

A jet is a set of convolution coefficients obtained by applying Gabor kernels with different frequencies and orientations to a point in an image

A set of jets extracted at all facial feature points is called a face graph. A bunch of face graph extracted from many people is called a bunch graph

Compute the similarity between Bunch Graph and Face Graph for searching facial feature points and person recognition

Jet Bunch GraphGabor Wavelet[1997,Wiskott]

10

Facial expression recognition using SVM

A viewer registers the personal model in advance Neutral image Personal facial expression SVM classifier

After EBGM recognizes a viewer, the system retrieves the personal model

Feature vector for SVM

Neutral (A) Positive (B)

11

Definition of Facial Expression Classes

Classes Meanings

Neutral (Neu) Expressionless

Positive (Pos) Happiness, Laughter, Pleasure, etc.

Negative (Neg) Anger, Disgust, Displeasure, etc.

Rejective (Rej)

Not watching the display

in the front direction,

Occluding part of face,

Tilting the face, etc.

12

Example of “Neutral”

13

Example of “Positive”

14

Example of “Negative”

15

Example of “Rejective”

16

Automatic rejection

If the frame not extracted face region in, the frame is tagged with “Rejective”

If the frame extracted face region in, the frame is used by training and testing data for SVM

Face region extraction

SVM

Neutral

Positive

Negative

Rejective

Yes

No

17

Experimental conditions

Two subject (A,B) watched four videos. The length of them was 17 minutes in average The categories were “variety shows” System recorded their facial video in

synchronization with the video content at 15fps After that, subjects tagged the video with four labels

18

Tagged results (manual)

Neu Pos Neg Rej Total

(frame)

Subject A 49865 7665 3719 1466 62715

Subject B 56531 2347 3105 775 62758

We used those experimental videos and the tagging data as training and testing data in the next SVM experiments

19

Preliminary experiment 1

Personal model

EBGM

SVM

Neutral image


Person recognition


recognition

・ Neutral

・ Positive

・Negative

・Rejective

Tag


Experiment for the performance of face regions extraction using AdaBoost

20

Preliminary experiment 1Face regions extraction using AdaBoost

Subject A Neu Pos Neg

False extraction 20 3 1

Total frames 49865 7665 3719

Rate (%) 0.040 0.039 0.027

Subject B Neu Pos Neg

False extraction 132 106 9

Total frames 56531 2347 3105

Rate (%) 0.234 4.516 0.290

21

Preliminary experiment 2

Personal model

EBGM

SVM

Neutral image


Person recognition


recognition

・ Neutral

・ Positive

・Negative

・Rejective

Tag


Experiment for the performance of person recognition using EBGM

22

Preliminary experiment 2Person recognition using EBGM

　 Subject A Neu Pos Neg

False recognition 2 0 0

Total frames 49845 7662 3718

Rate(%) 0.004 0.000 0.000

Subject B Neu Pos Neg

False recognition 2 20 0

Total frames 56399 2241 3096

Rate(%) 0.004 0.893 0.000

23

Experiment

Personal model

EBGM

SVM

Neutral image


Person recognition


recognition

・ Neutral

・ Positive

・Negative

・Rejective

Tag


Experiment for the performance of facial expression recognition using SVM

24

ExperimentFacial expression recognition using SVM

Three of four experimental videos (11,000sec*15fps*3videos 500,000frames) were ≒used for training data, and the rest for testing data cross validation

Precision and recall rate were computed by the followings

25

0.74

0.880.930.98

0.810.820.89

0.98

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Neu Pos Neg Rej

precisionrecall

Experimental resultsFacial expression recognition using SVM

26

Discussion

The averaged recall rate was 87% and the averaged precision rate was 88%.

When the subjects modestly expressed their emotion, the system often mistook the facial expression for Neutral.

When the subjects had an intermediate facial expression, the system often made a mistake because one expression class was only assumed in a frame.

27

Demo

28

Conclusion and future work

The evaluation of various categories and various subjects

Combination of user analysis and content analysis

Construct a automatic video content recommendation system

We proposed the system that tagged video contents with Neutral, Positive, Negative, Rejective labels based on viewer’s facial expression.

30

Experimental ResultConfusion matrix - subject A

Subject A　 Neu Pos Neg Rej Sum Recall(%)

Neu 48275 443 525 622 49865 96.81

Pos 743 6907 1 14 7665 90.11

Neg 356 107 3250 6 3719 87.39

Rej 135 0 5 1326 1466 90.45

Sum 49509 7457 3781 1968 62715 91.19

Precision(%) 97.51 92.62 85.96 67.38 85.87 　

31

　 Subject B Neu Pos Neg Rej Sum Recall(%)

Neu 56068 138 264 61 56531 99.18

Pos 231 2076 8 32 2347 88.45

Neg 641 24 2402 38 3105 77.36

Rej 203 0 21 551 775 71.10

Sum 57143 2238 2695 682 62758 84.02

Precision(%) 98.12 92.76 89.13 80.79 90.20 　

Experimental ResultConfusion matrix - subject B

32

Computation time

Recording facial video (15fps)

Processing facial video Pentium4 3GHz (1fps) Core2 Quad 2.4GHz (4fps)

If a user watches video content within about 6 hours per day, a system(Core2 Quad) can complete the process

33

Conventional system VS propose system

The video data used to evaluate the conventional system is not released, so we can’t make an easy comparison.

34

Learning in advance

Current proposed system The viewer’s neutral image His personal facial expression classifier

• He watches video content (for 50 minutes)• After that, he tags the video content

Future work (Saving user’s trouble) Semi-supervised or unsupervised learning

35

The video categories

Experiment in this study Only “variety shows”

Future work “dramas”, ”news”, etc.

But I expect our system not to work well because a viewer’s facial expression is not very changing when he watches such video

36

Conventional facial expression recognition

About facial expression recognition, many studies have been carried out.

So, we also have to consider other methods

37

Evaluation

The value of experimental results (Averaged recall for subject A + Averaged recall for subject B)/2 (Averaged precision for subject A + Averaged precision for subject B)/2

38

Q and A リアルタイム性について特徴点を増やす意味は事前学習バラエティだけでいいのか？コンテンツ推薦はどうやってやるのかなぜ EBGMか？ AAMは？ホラー映画エクマンの 6表情タギングはいつするの？テレビにカメラはついてないがボタンをおせばいいのでは？動いていないときは処理しないようにすれば処理時間

短縮できる毎フレームと区間、どっちがいいかキーワードや EPGだけで十分ではないか顔を撮影されるのは心理的な抵抗があるドラマをみて感動したってのはどこ？手動タギングは被験者自身がやるが、他人がやったら

どうか映像を 2回見ると影響はあるか Rejectiveの仕組みは？顔表情の分野のほかの手法と比較してどうか番組自動推薦をやろうとすると大量の被験者が必要と

なるが実現できるか？

2人で十分なのか被験者が表情を強調あるいは抑制していないことをど

うやって保障するのかタギングするときには、コンテンツをみるのか、自分

の顔をみるのか。表情が認識できなかったとき、どうやって番組を推薦

するのか。何かプランはあるか技術的な新規性はなにか？既存の技術を組み合わせた

だけじゃないのか。無表情画像と顔表情をあらかじめ登録するというのは、

大量のユーザがこのシステムを使うときには障壁とならないか。

スポーツや映画などではどうか。実環境では、照明変動や顔の動きにどう対応するか TRECVIDとかはどう？結論にメインコントリビューションが書かれていない協調フィルタリングって？

Tagging Video Contents with Positive/Negative Interest Based on User’s Facial Expression Masanori...

Documents

Transcript of Tagging Video Contents with Positive/Negative Interest Based on User’s Facial Expression Masanori...