Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...
-
Upload
libby-timmons -
Category
Documents
-
view
213 -
download
1
Transcript of Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo...
Visual Event Recognition in Videos by Learning from Web Data
Lixin Duan†, Dong Xu†, Ivor Tsang†, Jiebo Luo¶
† Nanyang Technological University, Singapore¶ Kodak Research Labs, Rochester, NY, USA
Outline
• Overview of the Event Recognition System• Similarity between Videos– Aligned Space-Time Pyramid Matching
• Cross-Domain Problem– Adaptive Multiple Kernel Learning
• Experiments• Conclusion
Overview
• GOAL: Recognize consumer videos
• Large intra-class variability; limited labeled videos
⋮⋮ ⋮
Sports
Picnic
Wedding
• GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube)
⋮⋮ ⋮
Sports
Picnic
Wedding
Overview
Consumer Videos
A Large Number of Web Videos
Overview
Video Database
Test video Classifier Output
• Flowchart of the system
• Pyramid matching methods
– Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1]
– Unaligned space-time pyramid matching, I. Laptev [2]
Similarity between Videos
Time axis Space axes Space-time axes
Similarity between Videos
• Aligned Space-Time Pyramid Matching– Each video is divided into non-overlapped space-
time volumes, where .– Greater variability
• Two-step approach– Distances between space-time volumes: solved by
existing methods such as bag-of-words model, I. Laptev [2]
Similarity between Videos
• Aligned Space-Time Pyramid Matching– Level 1
V i V j
Distance
Similarity between Videos
V i
Distance
V j
• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]
F̂ rc=arg minF rc∈{0,1}
∑u=1
H
∑v=1
I
F rc Drc ∑c=1
R
F rc=1 ,∀ r ;∑r=1
R
F rc=1 ,∀ c .s.t.
D(V i ,V j)=∑r=1
R
∑c=1
R
F̂ rc Drc
∑r=1
R
∑c=1
R
F̂ rc
Distance
Similarity between Videos
• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]
F̂ rc=arg minF rc∈{0,1}
∑u=1
H
∑v=1
I
F rc Drc ∑c=1
R
F rc=1 ,∀ r ;∑r=1
R
F rc=1 ,∀ c .s.t.
D(V i ,V j)=∑r=1
R
∑c=1
R
F̂ rc Drc
∑r=1
R
∑c=1
R
F̂ rc
V i V j
Cross-Domain Problem
• Data distribution mismatch between consumer videos and web videos– Consumer videos: Naturally captured– Web videos: Edited; Selected
• Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]
DIST k (DA ,DT )=‖ 1n A∑i=1
nA
𝜑 (xiA )−
1nT
∑i=1
nT
𝜑 (xiT )‖ℋ
⇒DIST k2 (DA ,DT )=tr(KS)
where , and .
Cross-Domain Problem
• Suppose there are pre-learned classifiers • is learned by SVM with the labeled training
data from both domains• Proposed target decision function
f T (x )=∑p=1
P
𝛽p f p(x )+Δ f (x)
where is the linear combination coefficient and is the perturbation function.
Prior information
Cross-Domain Problem
• Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function
• MKL:• MMD
Δ f (x )=∑m=1
M
dmwm′ 𝜑m (𝐱 )+b
where .
Ω (𝐝 )≔DISTk2 ( DA , DT )=tr (KS)=𝐡′𝐝
, where
where
Cross-Domain Problem
• Adaptive Multiple Kernel Learning (A-MKL)
min𝐝∈𝒟G (𝐝 )=1
2Ω2 (𝐝 )+𝜃 ⋅ J (𝐝)
where
J (𝐝 )= min𝐰m ,𝛃, b , 𝜉 i
12 (∑
m=1
M
dm‖𝐰m‖2+𝜆‖𝛃‖2)+C∑
i=1
n
𝜉 i
s . t . y i(∑p=1
P
𝛽 p f p (x)+∑m=1
M
dmwm′ 𝜑m ( x )+b)≥1−𝜉 i ,𝜉 i≥0
MMD Structural risk functional
Cross-Domain Problem
• Dual form of
• A-MKL algorithm– Iteratively solve the linear coefficients and the
dual variables in the dual form of .
min𝛂𝛂 ′𝟏+¿ 1
2(𝛂∘ 𝐲 ) ′ (∑
m=1
M
dm~𝐊m) (𝛂∘ 𝐲 ) ¿
s . t .𝛂 ′ 𝐲=0 ,𝟎≤𝛂 ≤C𝟏
Cross-Domain Problem
• Feature Replication (FR), H. Daumé III [6]– Augment features
• Domain Transfer SVM (DTSVM), L. Duan [7]– No prior information
• Adaptive SVM (A-SVM), J. Yang [8]
– is pre-defined– is modeled by SVM
Experiments
• Data set– 195 consumer videos and 906 web videos collected
by ourselves and from Kodak Consumer Video Benchmark Data Set [5]
– 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports”
– Training data: 3 videos per event from consumer videos and all web videos
– Test data: The rest consumer videos
Experiments
• Two types of features– Space-time (ST) feature, Laptev et al. [1]– SIFT feature, Lowe [2]
• Four types of base kernels– Gaussian: – Laplacian: – Inverse Square Distance: – Inverse Distance:
Experiments
• Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM)– ASTPM is better than USTPM at Level 1
Aligned Unaligned
Experiments
• 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels
• Average classifiers at Level ()– : 20 base classifiers learned by SVM– : 20 base classifiers learned by SVM– Pre-learned classifiers : 4 average classifiers
f T (𝐱 )=∑p=1
P
𝛽p f p(x)+∑m=1
M
dmwm′ 𝜑m ( x )+b
Experiments
• Comparisons of cross-domain learning methods– (a) SIFT features– (b) ST features– (c) SIFT features and ST features
– “parade”: 75.7% (A-MKL) vs. 62.2% (FR)
Experiments
• Comparisons of cross-domain learning methods
• Relative improvements– SVM_T: 36.9%– SVM_AT: 8.6%– Feature Replication (FR) [6]: 7.6%– Adaptive SVM (A-SVM) [7]: 49.6%– Domain Transfer SVM (DTSVM) [8]: 9.9%
•
• MKL-based methods – Better fuse SIFT features and ST features– Handle noise in the loose labels
Conclusion
• We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos.
• We develop a new aligned space-time pyramid matching method.
• We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.
References
[1] D. Xu and S.-F. Chang. Video event recognition using kernelmethods with multi-level temporal alignment. T-PAMI,30(11):1985–1997, 2008.[2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.[3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000.[4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.
References
[5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004.[6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.[7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009.[8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007.[9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
Thank you!