Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...

Visual Event Recognition in Videos by Learning from Web Data

Lixin Duan†, Dong Xu†, Ivor Tsang†, Jiebo Luo¶

† Nanyang Technological University, Singapore¶ Kodak Research Labs, Rochester, NY, USA

Outline

• Overview of the Event Recognition System• Similarity between Videos– Aligned Space-Time Pyramid Matching

• Cross-Domain Problem– Adaptive Multiple Kernel Learning

• Experiments• Conclusion

Overview

• GOAL: Recognize consumer videos

• Large intra-class variability; limited labeled videos

⋮⋮ ⋮

Sports

Picnic

Wedding

• GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube)

⋮⋮ ⋮

Sports

Picnic

Wedding

Overview

Consumer Videos

A Large Number of Web Videos

Overview

Video Database

Test video Classifier Output

• Flowchart of the system

• Pyramid matching methods

– Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1]

– Unaligned space-time pyramid matching, I. Laptev [2]

Similarity between Videos

Time axis Space axes Space-time axes


• Aligned Space-Time Pyramid Matching– Each video is divided into non-overlapped space-

time volumes, where .– Greater variability

• Two-step approach– Distances between space-time volumes: solved by

existing methods such as bag-of-words model, I. Laptev [2]


• Aligned Space-Time Pyramid Matching– Level 1

V i V j

Distance


V i

Distance

V j

• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

Distance


• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

V i V j

Cross-Domain Problem

• Data distribution mismatch between consumer videos and web videos– Consumer videos: Naturally captured– Web videos: Edited; Selected

• Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]

DIST k (DA ,DT )=‖ 1n A∑i=1

nA

𝜑 (xiA )−

1nT

∑i=1

nT

𝜑 (xiT )‖ℋ

⇒DIST k2 (DA ,DT )=tr(KS)

where , and .


• Suppose there are pre-learned classifiers • is learned by SVM with the labeled training

data from both domains• Proposed target decision function

f T (x )=∑p=1

P

𝛽p f p(x )+Δ f (x)

where is the linear combination coefficient and is the perturbation function.

Prior information


• Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function

• MKL:• MMD

Δ f (x )=∑m=1

M

dmwm′ 𝜑m (𝐱 )+b

where .

Ω (𝐝 )≔DISTk2 ( DA , DT )=tr (KS)=𝐡′𝐝

, where

where


• Adaptive Multiple Kernel Learning (A-MKL)

min𝐝∈𝒟G (𝐝 )=1

2Ω2 (𝐝 )+𝜃 ⋅ J (𝐝)

where

J (𝐝 )= min𝐰m ,𝛃, b , 𝜉 i

12 (∑

m=1

M

dm‖𝐰m‖2+𝜆‖𝛃‖2)+C∑

i=1

n

𝜉 i

s . t . y i(∑p=1

P

𝛽 p f p (x)+∑m=1

M

dmwm′ 𝜑m ( x )+b)≥1−𝜉 i ,𝜉 i≥0

MMD Structural risk functional


• Dual form of

• A-MKL algorithm– Iteratively solve the linear coefficients and the

dual variables in the dual form of .

min𝛂𝛂 ′𝟏+¿ 1

2(𝛂∘ 𝐲 ) ′ (∑

m=1

M

dm~𝐊m) (𝛂∘ 𝐲 ) ¿

s . t .𝛂 ′ 𝐲=0 ,𝟎≤𝛂 ≤C𝟏


• Feature Replication (FR), H. Daumé III [6]– Augment features

• Domain Transfer SVM (DTSVM), L. Duan [7]– No prior information

• Adaptive SVM (A-SVM), J. Yang [8]

– is pre-defined– is modeled by SVM

Experiments

• Data set– 195 consumer videos and 906 web videos collected

by ourselves and from Kodak Consumer Video Benchmark Data Set [5]

– 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports”

– Training data: 3 videos per event from consumer videos and all web videos

– Test data: The rest consumer videos

Experiments

• Two types of features– Space-time (ST) feature, Laptev et al. [1]– SIFT feature, Lowe [2]

• Four types of base kernels– Gaussian: – Laplacian: – Inverse Square Distance: – Inverse Distance:

Experiments

• Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM)– ASTPM is better than USTPM at Level 1

Aligned Unaligned

Experiments

• 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels

• Average classifiers at Level ()– : 20 base classifiers learned by SVM– : 20 base classifiers learned by SVM– Pre-learned classifiers : 4 average classifiers

f T (𝐱 )=∑p=1

P

𝛽p f p(x)+∑m=1

M

dmwm′ 𝜑m ( x )+b

Experiments

• Comparisons of cross-domain learning methods– (a) SIFT features– (b) ST features– (c) SIFT features and ST features

– “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

Experiments

• Comparisons of cross-domain learning methods

• Relative improvements– SVM_T: 36.9%– SVM_AT: 8.6%– Feature Replication (FR) [6]: 7.6%– Adaptive SVM (A-SVM) [7]: 49.6%– Domain Transfer SVM (DTSVM) [8]: 9.9%

•

• MKL-based methods – Better fuse SIFT features and ST features– Handle noise in the loose labels

Conclusion

• We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos.

• We develop a new aligned space-time pyramid matching method.

• We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

References

[1] D. Xu and S.-F. Chang. Video event recognition using kernelmethods with multi-level temporal alignment. T-PAMI,30(11):1985–1997, 2008.[2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.[3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000.[4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

References

[5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004.[6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.[7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009.[8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007.[9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Thank you!

Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...

Documents

Transcript of Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...