UCF VIRAT Efforts

38
Bag of Video-Words Video Representation UCF VIRAT EFFORTS

description

UCF VIRAT Efforts. Bag of Video-Words Video Representation. Outline. Bag of Video-words approach for video representation Feature detection Feature quantization Histogram-based video descriptor generation Preliminary experimental results on aerial videos - PowerPoint PPT Presentation

Transcript of UCF VIRAT Efforts

Page 1: UCF VIRAT Efforts

Bag of Video-Words Video Representation

UCF VIRAT EFFORTS

Page 2: UCF VIRAT Efforts

Outline Bag of Video-words approach for video

representation Feature detection Feature quantization Histogram-based video descriptor generation

Preliminary experimental results on aerial videos Discussion on ways to improve the performance

Page 3: UCF VIRAT Efforts

Bag of video-words approach (I)

Interest Point Detector

Motion Feature Detection

Page 4: UCF VIRAT Efforts

Bag of video-words approach (II)

Video-word A

Video-word B

Video-word C

Feature Quantization: Codebook Generation

Page 5: UCF VIRAT Efforts

Bag of video-words approach (III)

Histogram-based video descriptor generation

His

togr

am-b

ased

V

ideo

Des

crip

tor

Page 6: UCF VIRAT Efforts

Similarity Metrics Histogram Intersection

Chi-square distance

i baba iHiHHHSim ))(),(min(),(

))()(

))()((exp()),(exp(),(

22

iba

bababa iHiH

iHiHHHHHSim

Page 7: UCF VIRAT Efforts

Classifiers Bayesian Classifier K-Nearest Neighbors (KNN) Support Vector Machines (SVM)

Histogram Intersection Kernel Chi-square Kernel RBF (Radial Basis Function) Kernel

Page 8: UCF VIRAT Efforts

Experiments on Aerial videos Dataset

Blimp with a HD camera on a gimbal

11 Actions: Digging, gesturing, picking up, throwing, kicking, carrying object, walking, standing, running, entering vehicle, exiting vehicle

Page 9: UCF VIRAT Efforts

Clipping & Cropping Actions

- Optimal box is created so that the object of interest doesn't go out of view in all the frames (Start Frame to End Frame)

Start of Frame

End of Frame

Page 10: UCF VIRAT Efforts

Feature Detection for Video Clips20

0 Fe

atur

es

Digging Kicking Throwing Walking

Page 11: UCF VIRAT Efforts

Classification Results (I) “kicking”(22 clips) v.s. “non kicking” (22 clips)

Number of featuresPer video

Codebook 50 Codebook 100 Codebook 200

50 65.91% 79.55% 75.00%

100 79.55% 77.27% 77.27%

200 77.27% 79.55% 81.82%

Page 12: UCF VIRAT Efforts

Classification Results (II)

Page 13: UCF VIRAT Efforts

Classification Results (III) “Digging”, “Kicking”, “Walking”, “Throwing” ( 25clips x 4 )

digg

ing

kick

ing

thro

win

gw

alki

ng

Similarity Matrix(Histogram Intersection)

Page 14: UCF VIRAT Efforts

Classification Results (V) Average accuracy with different codebook size

Confusion table for the case of codebook size of 300

Number of FeaturesPer Video

Codebook 100 Codebook 200 Codebook 300

200 84.6% 85.0% 86.7%

Page 15: UCF VIRAT Efforts

Misclassified examples (I) Misclassified “walking” into

“kicking”

Page 16: UCF VIRAT Efforts

Misclassified examples (I) Misclassified “digging” into

“walking”

Page 17: UCF VIRAT Efforts

Misclassified examples (III) Misclassified “walking” into

“throwing”

Page 18: UCF VIRAT Efforts

How to improve the performance? Low Level Features

Stable motion features Different Motion Features Different Motion Feature Sampling Hybrid of Motion and Static Features

Video-words generation Unsupervised method

Hierarchical K-Means (David Nister, et al., CVPR 2006) Supervised method

Random Forest (Bill Triggs, et al., NIPS 2007) “Visual Bits” (Rong Jin, et al., CVPR 2008)

Classifiers SVM Kernels : histogram intersection v.s. Chi-Square distance Multiple Kernels

Page 19: UCF VIRAT Efforts

Stable motion features Motion compensation Video clipping and cropping

Start of Frame

End of Frame

Page 20: UCF VIRAT Efforts

Different Low-level Features Flattened gradient vector (magnitude) Histogram of Gradient (direction) Histogram of Optical Flow Combination of all types of features

Page 21: UCF VIRAT Efforts

Feature sampling Feature detection: Gabor filter or 3D Harris corner

detection Random sampling Grid-based sampling Bill Triggs et al., Sampling Strategies for Bag-of-Features

Image Classification, ECCV 2006

Page 22: UCF VIRAT Efforts

Hybrid of Motion and Static Features (I) Multiple-frame Features (spatiotemporal, motion)

3D Harris Capture the local spatiotemporal information around the interest points

Single-frame Features (spatial, static) 2D Harris detector MSER (Maximally Stable Extremal Regions ) detector Perform action recognition by a sequence instantaneous postures or

poses Overcome the shortcoming of multiple-frame features which require

relative stable camera motion Hybrid of motion and static features

Represent a video by the combination of multiple-frame and single-frame features

Page 23: UCF VIRAT Efforts

Hybrid of Motion and Static Features (II) Examples of 2D Harris and MSER feature

2D H

arri

sM

SE

R

Page 24: UCF VIRAT Efforts

Hybrid of Motion and Static Features (III) Experiments on three action datasets

KTH, 6 action categories, 600 videos UCF sports, 10 action categories, about 200

videos YouTube videos, 11 action categories,

about 1,100 videos

Page 25: UCF VIRAT Efforts

KTH dataset

Boxing Clapping Waving

Walking Jogging Running

Page 26: UCF VIRAT Efforts

Experimental results on KTH dataset Recognition using either Motion (L), Static (M) features and Hybrid

(R) features

Average Accuracy 92.66%Average Accuracy 82.96%Average Accuracy 87.65%

Page 27: UCF VIRAT Efforts

Results on UCF sports dataset

The average accuracy for static, motion and static+motion experimental strategy is 74.5%, 79.6% and 84.5% respectively.

Page 28: UCF VIRAT Efforts

YouTube Video Dataset (I)

Cycling Diving Golf Swinging

Riding Juggling

Page 29: UCF VIRAT Efforts

YouTube Video Dataset (II)

Basketball Shooting Swinging Tennis Swinging

Volleyball Spiking Trampoline Jumping

Page 30: UCF VIRAT Efforts

Results on YouTube dataset

The average accuracy for motion, static and hybrid features are 65.4%, 63.1% and 71.2%, respectively

Page 31: UCF VIRAT Efforts

Hierarchical K-Means (I) Traditional k-means

Slow when generating large size of codebook Less discriminative when dealing with large size of codebook

Hierarchical k-means Building a tree on the training features Children nodes are clusters of applying k-means on the parent

node Treat each node as a “word”, so the tree is a hierarchical

codebook D. Nister, Scalable Recognition with a Vocabulary Tree, CVPR 2006

Page 32: UCF VIRAT Efforts

Hierarchical K-Means (II) Advantages

Tree also defines the quantization of features, so it integrates the indexing and quantization in one tree

It is much more efficient when generating a large size of codebook

The word (node) frequency can be integrated with the inverse document frequency to weight it.

It can generate more discriminative word than that of k-means

Large size of codebook which can generally obtain better performance.

Page 33: UCF VIRAT Efforts

Random Forests (I) K-means based quantization methods

Unsupervised It suffers from the high dimensionality of the features

Single-tree based methods Each path through the tree typically accesses only a few of the feature

dimensions It fails to deal with the variance of the feature dimensions It is fast, but performance is not even as good as k-means

Random Forests Build an ensemble trees Each tree node is splitted by checking the randomly selected subset of feature

dimensions Building all the trees using video or image labels (supervised method) Instead of taking the trees as an ensemble classifiers, we treat all the leaves of

all the trees as “words”. The generated “words” are more meaningful and discriminative, since it contains

class category information

Page 34: UCF VIRAT Efforts

Random Forests (II)

Page 35: UCF VIRAT Efforts

“Visual Bits” (I) Both k-means or random forests

Treat all the features equally when generating the codebooks. Hard assignment (each feature can only be assigned to one “word”)

“Visual Bits” Rong Jin et al., Unifying Discriminative Visual Codebook Generation

with Classifier Training for Object Category Recognition, CVPR 2008 Training a visual codebook for each category, so it can overcome

the shortcomings of “hard assignment” of the features It integrates the classification and codebook generation together, so

it is able to select the relevant features by weighting them

Page 36: UCF VIRAT Efforts

“Visual Bits” (II)

Page 37: UCF VIRAT Efforts

Classifiers Kernel SVM

Histogram Intersection Chi-square distance

Multiple kernels Fuse different type of features Fuse different distance metrics

Page 38: UCF VIRAT Efforts

The end… Thank you!