Learning realistic human actions from movies

PowerPoint Presentation

Learning realistic human actions from movies By Abhinandh PalicherlaDivya AkuthotaSamish Chandra KolliIntroductionAddress recognition of natural human actions in diverse and realistic video settings. Addresses the limitations (lack of realistic and annotated video datasets)Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images .

Existing datasets for human action recognition provide samples for few action classes .

The progress in visual recognition is transferred to the domain of video recognition and the human action recognition

The existing datasets for human actions are limited ,recorded under controlled and simplified settings.But the demands of real applications are focused on natural video with human actions subjected to individual variations of people in expression posture, motion and clothing .

Inorder to address the limitations in the current datasets and collect realistic video samples with human actions , we implement 3To Address these limitations we implement

Automatic annotation of human actions Manual annotation is difficult

Video classification for action recognition

These two methods namely automatic annotation of human actions and video classification for action recognition.

Video classification based on video has the similar problems that object recognition in static images .But in case of static images these problems are well handled by the bag of features representation combined with machine learning techniques like support vector machines .

How to generalise these results to the recognition of realistic human actions is still a question .

Hence for collecting the realistic human action datasets we employ spatio-temporal features and generalise spatial pyramids to spatio-temporal domain . Here we use only 8 action classes in movies

4Automatic annotation of human actionsAlignment of actions in scripts and videosText Retrieval of human actionsVideo datasets for human actionsAlignment of actions in scripts and videos 117201:20:17,240 --> 01:20:20,437Why weren't you honest with me?Why'd you keep your marriage a secret?

117301:20:20,640 --> 01:20:23,598lt wasn't my secret, Richard.Victor wanted it that way.

117401:20:23,800 --> 01:20:26,189Not even our closest friendsknew about our marriage. RICK Why weren't you honest with me? Why did you keep your marriage a secret?

Rick sits down with Ilsa. ILSA Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage.01:20:1701:20:23subtitlesmovie scriptScripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com, www.weeklyscript.com Subtitles (with time info.) are available for the most of moviesCan transfer time to scripts by text alignmentScript alignment: Evaluation

Example of a visual false positiveA black car pulls up, two army officers get out.Annotate action samples in textDo automatic script-to-video alignmentCheck the correspondence of actions in scripts and movies

a: quality of subtitle-script matching

Text Retrieval of human actions Will gets out of the Chevrolet. Erin exits her new truckLarge variation of action expressions in text:

GetOutCar action:Potential false positives:About to sit down, he freezes=> Supervised text classification approach

Video Datasets for Human actions

12 movies20 different moviesLearn vision-based classifier from automatic training setCompare performance to the manual training setVideo Classification for action recognition10Space-time featuresGood performance for action recognitionCompact and provide tolerance to background clutter, occlusions and scale changes.11Interest point detectionHarris operator - with a space-time extension.We use multiple levels of spatio-temporal scales = 2(1+i)/2 , i = 1, , 6 = 2j/2 , j = 1, 2

I. Laptev. On space-time interest points. IJCV, 64(2/3):107123, 2005.

12

13DescriptorsCompute histogram descriptors of volume around the interest points.(x , y , t ) is related to the detection scales by x , y = 2k, t = 2k Each volume is divided into (nx, ny, nt) grid of cuboids.We use k = 9, nx, ny=3, nt=2.14..contdFor each cuboid, we calculate HoG and HoF (optic flow) descriptorsVery similar to SIFT descriptors, adapted to the third dimension.15Spatio-temporal BoFConstruct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)Assign each feature to one word.Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio-temporal grid.If divided into grids, concatenate and normalize.16Spatio-temporal BoFConstruct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)Assign each feature to one word.Compute a frequency histogram for the entire video, Or, a subsequence defined by a spatio-temporal grid.If divided into grids, concatenate and normalize.17GridsWe divide both spatial and temporal dimensions.Spatial 1x1, 2x2, 3x3, v1x3, h3x1, o2x2Temporal t1, t2, t3, ot26 * 4 = 24 possible grid combinations! Descriptor + grid = channel.

18Non-linear SVMClassification using a non-linear SVMMulti-channel Gaussian kernel

V = vocab size, A = mean distances between training samplesBest set of channels for a training set is found by a greedy approach.

19What channels to use?Channels may complement each otherGreedy approach to pick the best combination

Combining channels is more advantageousTable: Classification performance of different channels and their combinations

Evaluation of spatio-temporal grids

Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie datasetResults with the KTH dataset

Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presentedIt contains six types ofhuman actions, namely walking, jogging, running, boxing,hand waving and hand clapping, performed several timesby 25 subjects. The sequences were taken for four differentscenarios: outdoors, outdoors with scale variation, outdoorswith different clothes and indoors.222391 sequences divided into the training/validation set (8+8 people) and test set (9 people)10 fold cross validation

Table: Confusion matrix for the KTH actionsResults with the KTH dataset23Robustness to noise in the training dataUp to p=0.2 the performance decreases insignificantlyAt p=0.4 the performance decreases by around 10%

Figure: Performance of our video classification approach in the presence of wrong labelsWe can, therefore, predicta very good performance for the proposed automatic trainingscenario, where the observed amount of wrong labels isaround 40%.24Action recognition in real-word videos

Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)Action recognition in real-world videosFigure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/negthe rapid getting up is typical for GetOutCarthe false negatives are very difficult to recognizeoccluded handshakehardly visible person getting out of the car

Conclusions SummaryAutomatic generation of realistic action samplesTransfer of recent bag-of-features experience to videos Improved performance on KTH benchmarkDecent results for actions in real-videosFuture directionImproving the script-video alignmentExperimenting with space-time-low-level-featuresInternet-scale video search

Thank you

Learning realistic human actions from movies

Documents

Transcript of Learning realistic human actions from movies