Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra...

28
Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli

Transcript of Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra...

Page 1: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Learning realistic human actions from movies

By Abhinandh PalicherlaDivya AkuthotaSamish Chandra Kolli

Page 2: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Introduction

Address recognition of natural human actions in diverse and realistic video settings.

Addresses the limitations (lack of realistic and annotated video datasets)

Page 3: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Visual recognition progressed from classifying toy objects towards recognizing the classes of objects and scenes in natural images .

Existing datasets for human action recognition provide samples for few action classes .

Page 4: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

To Address these limitations we implement

• Automatic annotation of human actions • Manual annotation is difficult

• Video classification for action recognition

Page 5: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Automatic annotation of human actions

Alignment of actions in scripts and videosText Retrieval of human actionsVideo datasets for human actions

Page 6: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Alignment of actions in scripts and videos

…117201:20:17,240 --> 01:20:20,437

Why weren't you honest with me?Why'd you keep your marriage a secret?

117301:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.Victor wanted it that way.

117401:20:23,800 --> 01:20:26,189

Not even our closest friendsknew about our marriage.…

…RICK

Why weren't you honest with me? Why did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage.

01:20:17

01:20:23

subtitlesmovie script

• Scripts available for >500 movies (no time synchronization) www.dailyscript.com, www.movie-page.com,

www.weeklyscript.com …

• Subtitles (with time info.) are available for the most of movies• Can transfer time to scripts by text alignment

Page 7: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Script alignment: Evaluation

Example of a “visual false positive”

A black car pulls up, two army officers get out.

• Annotate action samples in text• Do automatic script-to-video alignment• Check the correspondence of actions in scripts and movies

a: quality of subtitle-script matching

Page 8: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Text Retrieval of human actions

“… Will gets out of the Chevrolet. …” “… Erin exits her new truck…”

• Large variation of action expressions in text:

GetOutCar action:

Potential false positives: “…About to sit down, he freezes…”

• => Supervised text classification approach

Page 9: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Video Datasets for Human actions 1

2 m

ovie

s2

0

diff

ere

nt

movie

s

• Learn vision-based classifier from automatic training set• Compare performance to the manual training set

Page 10: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

Video Classification for action recognition

Page 11: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

SPACE-TIME FEATURES

Good performance for action recognition Compact and provide tolerance to

background clutter, occlusions and scale changes.

Page 12: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

INTEREST POINT DETECTION

Harris operator - with a space-time extension.

We use multiple levels of spatio-temporal scales

σ = 2(1+i)/2 , i = 1, …, 6τ = 2j/2 , j = 1, 2

I. Laptev. On space-time interest points. IJCV, 64(2/3):107–123, 2005.

Page 13: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.
Page 14: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

DESCRIPTORS Compute histogram descriptors of volume

around the interest points.(∆x , ∆y , ∆t ) is related to the detection scales by ∆x , ∆y =

2kσ, ∆t = 2kτ

Each volume is divided into (nx, ny, nt) grid of cuboids.

We use k = 9, nx, ny=3, nt=2.

Page 15: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

..CONTD

For each cuboid, we calculate HoG and HoF (optic flow) descriptors

Very similar to SIFT descriptors, adapted to the third dimension.

Page 16: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

SPATIO-TEMPORAL BOF

Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)

Assign each feature to one word. Compute a frequency histogram for the

entire video, Or, a subsequence defined by a spatio-temporal grid.

If divided into grids, concatenate and normalize.

Page 17: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

SPATIO-TEMPORAL BOF

Construct a visual vocabulary using k-means, with k = 4000. (Just like what we do in hw3)

Assign each feature to one word. Compute a frequency histogram for the

entire video, Or, a subsequence defined by a spatio-temporal grid.

If divided into grids, concatenate and normalize.

Page 18: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

GRIDS

We divide both spatial and temporal dimensions.

Spatial – 1x1, 2x2, 3x3, v1x3, h3x1, o2x2 Temporal – t1, t2, t3, ot2 6 * 4 = 24 possible grid combinations! Descriptor + grid = channel.

Page 19: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

NON-LINEAR SVM Classification using a non-linear SVM Multi-channel Gaussian kernel

V = vocab size, A = mean distances between training samples

Best set of channels for a training set is found by a greedy approach.

Page 20: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

WHAT CHANNELS TO USE?

Channels may complement each other Greedy approach to pick the best

combination

Combining channels is more advantageousTable: Classification performance of different channels and their combinations

Page 21: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

EVALUATION OF SPATIO-TEMPORAL GRIDS

Figure: Number of occurrences for each channel component within the optimized channel combinations for the KTH action dataset and our manually labeled movie dataset

Page 22: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

RESULTS WITH THE KTH DATASET

Figure: Sample frames from the KTH actions sequences, all six classes (columns) and scenarios (rows) are presented

Page 23: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

• 2391 sequences divided into the training/validation set (8+8 people) and test set (9 people)

• 10 fold cross validation

Table: Confusion matrix for the KTH actions

RESULTS WITH THE KTH DATASET

Page 24: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

ROBUSTNESS TO NOISE IN THE TRAINING DATA

Up to p=0.2 the performance decreases insignificantly

At p=0.4 the performance decreases by around 10%

Figure: Performance of our video classification approach in the presence of wrong labels

Page 25: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

ACTION RECOGNITION IN REAL-WORD VIDEOS

Table: Average precision (AP) for each action class of our test set. Comparison results for clean (annotated) and automatic training data and also results for a random classifier (chance)

Page 26: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

ACTION RECOGNITION IN REAL-WORLD VIDEOS

Figure: Example results for action classification trained on the automatically annotated data. We show the key frames for test movies with the highest confidence values for true/false; pos/neg

• the rapid getting up is typical for “GetOutCar”• the false negatives are very difficult to recognize

• occluded handshake• hardly visible person getting out of the car

Page 27: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

CONCLUSIONS

Summary Automatic generation of realistic action samples Transfer of recent bag-of-features experience to

videos Improved performance on KTH benchmark Decent results for actions in real-videos

Future direction Improving the script-video alignment Experimenting with space-time-low-level-features Internet-scale video search

Page 28: Learning realistic human actions from movies By Abhinandh Palicherla Divya Akuthota Samish Chandra Kolli.

THANK YOU