SUPER: Towards Real-time Event Recognition in Internet Videos
-
Upload
mackenzie-gibbs -
Category
Documents
-
view
27 -
download
0
description
Transcript of SUPER: Towards Real-time Event Recognition in Internet Videos
![Page 1: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/1.jpg)
SUPER: Towards Real-time Event Recognition in Internet Videos
Yu-Gang JiangSchool of Computer Science
Fudan UniversityShanghai, China
ACM ICMR 2012, Hong Kong, June 2012
Speeded Up Event Recognition
ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012.
![Page 2: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/2.jpg)
2
The Problem• Recognize high-level events in videos
We’re particularly interested in Internet Consumer videos
• Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis …
…
![Page 3: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/3.jpg)
3
Our Objective
Improve Efficiency
Maintain Accuracy
![Page 4: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/4.jpg)
The Baseline Recognition Framework
4
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.
Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task
![Page 5: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/5.jpg)
Three Audio-Visual Features…
5
• SIFT (visual) – D. Lowe, IJCV ‘04
• STIP (visual)– I. Laptev, IJCV ‘05
• MFCC (audio) … 16ms 16ms
![Page 6: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/6.jpg)
Bag-of-words Representation• SIFT / STIP / MFCC words• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)
Keypoint extraction
Vocabulary 1
SIF
T fe
atur
e sp
ace
......... .........
Vocabulary 2
DoG Hessian Affine
BoW histograms Using Soft-Weighting
.........
Vocabulary Generation BoW Representation
Bag-of-SIFT
6Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010
![Page 7: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/7.jpg)
Baseline Speed…
7
Feature extraction
SIFT
Spatial-temporal
interest points
MFCC audio feature
Late Averag
e Fusion
χ2 kernel SVM
Classifier
• 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling
82.0
916.8
2.36~2.0
0<<1
Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).
Classification time is measured by classifying a video using classifiers of all the 20 categories
Total: 1003 seconds per video !
![Page 8: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/8.jpg)
Basketball
Baseball
Soccer
Ice Skating
Skiing
Swimming
Biking
Cat
Dog
Bird
Graduation
Birthday Celebration
Wedding Reception
Wedding Ceremony
Wedding Dance
Music Performance
Non-music Performance
Parade
Beach
Playground
8
Dataset: Columbia Consumer Videos (CCV)
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.
![Page 9: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/9.jpg)
9
Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009.
Feature Options• (Sparse) SIFT• STIP• MFCC• Dense SIFT (DIFT)• Dense SURF (DURF)• Self-Similarities (SSIM)• Color Moments (CM)• GIST• LBP• TINY
Suggested feature combinations:
![Page 10: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/10.jpg)
10
Classifier Kernels• Chi Square Kernel• Histogram Intersection
Kernel (HI)• Fast HI Kernel (fastHI)
Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008.
![Page 11: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/11.jpg)
Multi-modality Fusion• Early Fusion
Feature concatenation
• Kernel FusionKf=K1+K2+…
• Late Fusionfusion of classificationscore
MFCC, DURF, SSIM, CM, GIST, LBP
MFCC, DURF
![Page 12: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/12.jpg)
12
Frame Sampling
• DURF Uniformly sampling 16 frames per video seems sufficient.
K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008.
![Page 13: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/13.jpg)
13
Frame Sampling
• MFCC Sampling audio frames is always harmful.
![Page 14: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/14.jpg)
14
Summary• Feature: Dense SURF (DURF), MFCC, plus some
global features• Classifier: Fast HI kernel SVM• Fusion: Early• Frame Selection: Audio - No; Visual - Yes
220-fold speed-up!
![Page 15: SUPER: Towards Real-time Event Recognition in Internet Videos](https://reader030.fdocuments.us/reader030/viewer/2022032607/568130fa550346895d9724f8/html5/thumbnails/15.jpg)
15
Demo…