Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...

26
Visual Event Recognition in Videos by Learning from Web Data Lixin Duan , Dong Xu , Ivor Tsang , Jiebo Luo Nanyang Technological University, Singapore Kodak Research Labs, Rochester, NY, USA

Transcript of Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...

Page 1: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Visual Event Recognition in Videos by Learning from Web Data

Lixin Duan†, Dong Xu†, Ivor Tsang†, Jiebo Luo¶

† Nanyang Technological University, Singapore¶ Kodak Research Labs, Rochester, NY, USA

Page 2: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Outline

• Overview of the Event Recognition System• Similarity between Videos– Aligned Space-Time Pyramid Matching

• Cross-Domain Problem– Adaptive Multiple Kernel Learning

• Experiments• Conclusion

Page 3: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Overview

• GOAL: Recognize consumer videos

• Large intra-class variability; limited labeled videos

⋮⋮ ⋮

Sports

Picnic

Wedding

Page 4: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

• GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube)

⋮⋮ ⋮

Sports

Picnic

Wedding

Overview

Consumer Videos

A Large Number of Web Videos

Page 5: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Overview

Video Database

Test video Classifier Output

• Flowchart of the system

Page 6: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

• Pyramid matching methods

– Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1]

– Unaligned space-time pyramid matching, I. Laptev [2]

Similarity between Videos

Time axis Space axes Space-time axes

Page 7: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Similarity between Videos

• Aligned Space-Time Pyramid Matching– Each video is divided into non-overlapped space-

time volumes, where .– Greater variability

• Two-step approach– Distances between space-time volumes: solved by

existing methods such as bag-of-words model, I. Laptev [2]

Page 8: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Similarity between Videos

• Aligned Space-Time Pyramid Matching– Level 1

V i V j

Distance

Page 9: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Similarity between Videos

V i

Distance

V j

• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

Page 10: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Distance

Similarity between Videos

• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

V i V j

Page 11: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Data distribution mismatch between consumer videos and web videos– Consumer videos: Naturally captured– Web videos: Edited; Selected

• Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]

DIST k (DA ,DT )=‖ 1n A∑i=1

nA

𝜑 (xiA )−

1nT

∑i=1

nT

𝜑 (xiT )‖ℋ

⇒DIST k2 (DA ,DT )=tr(KS)

where , and .

Page 12: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Suppose there are pre-learned classifiers • is learned by SVM with the labeled training

data from both domains• Proposed target decision function

f T (x )=∑p=1

P

𝛽p f p(x )+Δ f (x)

where is the linear combination coefficient and is the perturbation function.

Prior information

Page 13: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function

• MKL:• MMD

Δ f (x )=∑m=1

M

dmwm′ 𝜑m (𝐱 )+b

where .

Ω (𝐝 )≔DISTk2 ( DA , DT )=tr (KS)=𝐡′𝐝

, where

where

Page 14: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Adaptive Multiple Kernel Learning (A-MKL)

min𝐝∈𝒟G (𝐝 )=1

2Ω2 (𝐝 )+𝜃 ⋅ J (𝐝)

where

J (𝐝 )= min𝐰m ,𝛃, b , 𝜉 i

12 (∑

m=1

M

dm‖𝐰m‖2+𝜆‖𝛃‖2)+C∑

i=1

n

𝜉 i

s . t . y i(∑p=1

P

𝛽 p f p (x)+∑m=1

M

dmwm′ 𝜑m ( x )+b)≥1−𝜉 i ,𝜉 i≥0

MMD Structural risk functional

Page 15: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Dual form of

• A-MKL algorithm– Iteratively solve the linear coefficients and the

dual variables in the dual form of .

min𝛂𝛂 ′𝟏+¿ 1

2(𝛂∘ 𝐲 ) ′ (∑

m=1

M

dm~𝐊m) (𝛂∘ 𝐲 ) ¿

s . t .𝛂 ′ 𝐲=0 ,𝟎≤𝛂 ≤C𝟏

Page 16: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Cross-Domain Problem

• Feature Replication (FR), H. Daumé III [6]– Augment features

• Domain Transfer SVM (DTSVM), L. Duan [7]– No prior information

• Adaptive SVM (A-SVM), J. Yang [8]

– is pre-defined– is modeled by SVM

Page 17: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• Data set– 195 consumer videos and 906 web videos collected

by ourselves and from Kodak Consumer Video Benchmark Data Set [5]

– 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports”

– Training data: 3 videos per event from consumer videos and all web videos

– Test data: The rest consumer videos

Page 18: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• Two types of features– Space-time (ST) feature, Laptev et al. [1]– SIFT feature, Lowe [2]

• Four types of base kernels– Gaussian: – Laplacian: – Inverse Square Distance: – Inverse Distance:

Page 19: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM)– ASTPM is better than USTPM at Level 1

Aligned Unaligned

Page 20: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels

• Average classifiers at Level ()– : 20 base classifiers learned by SVM– : 20 base classifiers learned by SVM– Pre-learned classifiers : 4 average classifiers

f T (𝐱 )=∑p=1

P

𝛽p f p(x)+∑m=1

M

dmwm′ 𝜑m ( x )+b

Page 21: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• Comparisons of cross-domain learning methods– (a) SIFT features– (b) ST features– (c) SIFT features and ST features

– “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

Page 22: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Experiments

• Comparisons of cross-domain learning methods

• Relative improvements– SVM_T: 36.9%– SVM_AT: 8.6%– Feature Replication (FR) [6]: 7.6%– Adaptive SVM (A-SVM) [7]: 49.6%– Domain Transfer SVM (DTSVM) [8]: 9.9%

• MKL-based methods – Better fuse SIFT features and ST features– Handle noise in the loose labels

Page 23: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Conclusion

• We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos.

• We develop a new aligned space-time pyramid matching method.

• We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

Page 24: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

References

[1] D. Xu and S.-F. Chang. Video event recognition using kernelmethods with multi-level temporal alignment. T-PAMI,30(11):1985–1997, 2008.[2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.[3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000.[4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

Page 25: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

References

[5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004.[6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.[7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009.[8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007.[9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Page 26: Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Thank you!