CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches

Outline Introduction Mining Discriminative Patches Analyzing Videos Experimental Evaluation & Conclusion

1. Introduction Q.1:What does it mean to understand this video ? Q.2:How might we achieve such an understanding?

1. Introduction Video single feature vector semantic action object bits and pieces General framework detect object primitive actionsBayesian networks storyline

1. Introduction Drawback: computational models for identifying semantic entities are not robust enough to serve as a basis for video analysis

1. Introduction Represent video use not use Discriminative spatio-temporal patches global feature vector or set of semantic entities Discriminative spatio-temporal patches primitive human action semantic object human-object pair random but informative patches correspond automatically mined from training data consisting of hundreds of videos

1. Introduction spatio-temporal patches act as a discriminative vocabulary for action classification establish strong correspondence between patches in training and test videos. Using label transfer techniques align the videos and perform tasks (Ex. object localization, finer-level action detection etc.)

1. Introduction

2. Mining Discriminative Patches Two conditions Challenge (1)They occur frequently within a class. (2)They are distinct from patches in other classes. (1)Space of potential spatio-temporal patches is extremely large given that these patches can occur over a range of scales. (2) Overwhelming majority of video patches are uninteresting.

2. Mining Discriminative Patches Paradigm : bag-of words Major drawbacks Step1:Sample a few thousand patches, perform k-means clustering to find representative clusters Step2:Rank these clusters based on membership in different action classes. (1)High-Dimensional Distance Metric (2)Partitioning

2. Mining Discriminative Patches (1)High-Dimensional Distance Metric K-means use standard distance metric (Ex. Euclidean or normalized cross-correlation) Not well in high-dimensional spaces We use HOG3D

2. Mining Discriminative Patches (2)Partitioning Standard clustering algorithms partition the entire feature space. Every data point is assigned to one of the clusters during the clustering procedure. However, in many cases, assigning cluster memberships to rare background patches is hard. Due to the forced clustering they significantly diminish the purity of good clusters to which they are assigned

2. Mining Discriminative Patches Resolve these issues Using Exemplar-SVM(e-SVM) to learn 1.Use an exemplar-based clustering approach 2.Every patch is considered as a possible cluster center Drawback : computationally infeasible Resolve: use motion use Nearest Neighbor

2. Mining Discriminative Patches Training videos Validation partition : rank the clusters based on representativeness Training partition: (form cluster) ( ) Using simple nearest-neighbor approach(typically k=20) ( ) Score each patch and rank ( )select a few patches per action class and use the e-SVM to learn ( )e-SVM are used to form clusters ( )re-rank

2. Mining Discriminative Patches Goal : smaller dictionary(set of representative patches) Criteria (a)Appearance Consistency (b)Purity Consistency score tf-idf (score): same class/different class All patches are ranked using a linear combination of the two score

2. Mining Discriminative Patches

3. Analyzing Videos Action Classification Beyond Classification: Explanation via Discriminative Patches Top n e-SVM detectors input : test videosfeature vector SVM classifier output : class Q. How we can use detections of discriminative patches for establishing correspondences between training and test videos? Q. Which detections to select for establishing correspondence?

3. Analyzing Videos Context-dependent Patch Selection Vocabulary size : Ncandidate detections :{D 1,D 2,,D N } whether or not the detection of e-SVM i is selected : x i Appearance term(A i ):e-SVM score for patch i Class Consistency term(C li ):This term promotes selection of certain e-SVMs over others given the action class. For example, for the weightlifting class it prefers selection of the patches with man and bar with vertical motion. We learn C l from the training data by counting the number of times that an e-SVM fires for each class.

3. Analyzing Videos Optimization Integer Program is an NP-hard problem use IPFP algorithm 5~10 iterations Penalty term(P ij ):is the penalty term for selecting a pair of detections together. (1)e-SVMs i and j do not fire frequently together in the training data. (2) the e-SVMs i and j are trained from different action classes.

4. Experimental Evaluation Datasets :UCF-50,Olympics Sport Implementation Details: Classification Results Our current implementation considers only cuboid patches Patches are represented with HOG3D features (4x4x5 cells with 20 discrete orientations).

4. Experimental Evaluation

Correspondence and Label Transfer

4. Experimental Evaluation

Conclusion 1.A new representation for videos. 2.Automatically mine these patches using exemplar-based clustering approach. 3.Obtaining strong correspondence and align the videos for transferring annotations. 4.As a vocabulary to achieve state of the art results for action classification.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Documents

Transcript of CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.