Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features...

35
Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels Kai Guo, Prakash Ishwar, and Janusz Konrad Department of Electrical & Computer Engineering

Transcript of Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features...

Page 1: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action Recognition in Video by Sparse Representationon Covariance Manifolds of Silhouette Tunnels

Kai Guo, Prakash Ishwar, and Janusz Konrad

Department of Electrical & Computer Engineering

Page 2: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Motivation

out of 352

• Recognize actions in video

• Applications

Surveillance: Sports & entertainment:

Wildlife habitatmonitoring:

Tools for hearing impaired:

skate

walk

Page 3: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 353

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Page 4: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 354

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Scene• Single object • No significant clutter and occlusion• No significant illumination change

Page 5: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 355

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Acquisition• Camera motion • Camera viewpoint and zoom• Camera imperfections

Scene• Single object • No significant clutter and occlusion• No significant illumination change

Page 6: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 356

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Acquisition• Camera motion • Camera viewpoint and zoom• Camera imperfections

Scene• Single object • No significant clutter and occlusion• No significant illumination change

Acquisition• Single camera, fixed viewpoint• No significant camera motion

and distortion

Page 7: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 357

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Acquisition• Camera motion • Camera viewpoint and zoom• Camera imperfections

Scene• Single object • No significant clutter and occlusion• No significant illumination change

Acquisition• Single camera, fixed viewpoint• No significant camera motion

and distortion

Action• Non-rigid objects• Complex motion (ballet video)• Intra and inter object motion

variability

Page 8: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Challenges and assumptions

out of 358

Scene• Multiple objects • Clutter and occlusion• Illumination variability

Acquisition• Camera motion • Camera viewpoint and zoom• Camera imperfections

Scene• Single object • No significant clutter and occlusion• No significant illumination change

Acquisition• Single camera, fixed viewpoint• No significant camera motion

and distortion

Action• Non-rigid objects• Complex motion (ballet video)• Intra and inter object motion

variability

Action• Non-rigid objects• Complex motion• Intra and inter object motion

variability

Page 9: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Problem statement

out of 359

Given:

Task: Action Recognition

Dictionary of labeled training data

walk run jump wave

unknown action

walk

Recall: challenges– non-rigid object– complex motion– intra and inter object motion variability

Page 10: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Related work

out of 3510

Shape-basedfeatures

Interest -point based features

Geometric human body

features

Motion-based features

NearestNeighbor

[Gorelick-Irani-PAMI’07][Bobick-Davis-PAMI’01]

[Collins-Gross-ICAFGR’02]

[Dollar-Rabaud-VS PETS’05]

[Cunado-Nixon-CVIU’03]

[Wang-Ning-ICSVT’04]

[Seo-Milanfar-PAMI (submitted)],

[Liu-Ali-CVPR’08][Lowe-IJCV’04]

SVM[Ikizler-Duygulu-

LNCS’07][Ahmad-Lee-Journal of

Multimedia’10]

[Shuldt-Laptev-ICPR’04]

[Laptev-CVPR’08]

[Goncalves-Bernardo-CVPR’95]

[Danafar-Gheissari-ACCV07], [Scovanner-

Ali-ACMMultimedia’07]

Boosting [Zhang-Liu-ICCV’09] [Smith-Shah-ICCV’05] -

[Alireza-Mori-CVPR’08],

[Ke-Sukthankar-ICCV’05]

Graphical (Probabilistic)

model[Chen-Wu-ICDMW’08]

[Niebles-Lei-IJCV’08][Wong-Cipolla-

ICCV’07] [Rohr-CVGIP’94] [Ali-Shah-PAMI’10]

Action features

Actio

n c

lass

ifier

s

Page 11: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action recognition framework

out of 3511

Action recognition = Supervised learning problem,

where data samples are video clips

Two main ingredients:

— Representation of samples

— Classification

Page 12: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3512

video clip bag of dense local featureslocal features

Page 13: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3513

• How to reduce the dimension of bag of local features?

video clip bag of dense local featureslocal features

Page 14: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3514

• How to reduce the dimension of bag of local features?

video clip

Action representation

pdf

— Ideally, one should learn and compare pdfs of features— Problem: it may not reduce the dimensionality

bag of dense local featureslocal features

Page 15: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3515

• How to reduce the dimension of bag of local features?

video clip

Action representation

— Idea-1: Learn and compare 1st order statistics (mean)— Problem: not sufficiently discriminative

Meanbag of dense local featureslocal features

Page 16: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3516

• How to reduce the dimension of bag of local features?

video clip covariance matrix

Action representation

— Idea-2: Learn and compare 2nd order statistics (covariance)[Tuzel-Porikli-Meer PAMI’08]

bag of dense local featureslocal features

Page 17: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3517

• How to reduce the dimension of bag of local features?

video clip covariance matrix

ΨAction representation

— Idea-2: Learn and compare 2nd order statistics (covariance)

— Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix)

[Tuzel-Porikli-Meer PAMI’08]

bag of dense local featureslocal features

Page 18: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3518

• How to reduce the dimension of bag of local features?

video clip covariance matrix

ΨAction representation

— Idea-2: Learn and compare 2nd order statistics (covariance)

Main thesis: covariance matrix is “sufficient” for action recognition

— Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix)

[Tuzel-Porikli-Meer PAMI’08]

bag of dense local featureslocal features

Page 19: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action representation

out of 3519

• How to reduce the dimension of bag of local features?

video clip covariance matrix

ΨAction representation

— Idea-2: Learn and compare 2nd order statistics (covariance)

Main thesis: covariance matrix is “sufficient” for action recognition

— Output: feature covariance matrix (e.g., 13-dim vector -> 91-dim covariance matrix)

[Tuzel-Porikli-Meer PAMI’08]

silhouette tunnel bag of dense local features

Page 20: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Covariance manifold

• Covariance matrices form:— a Riemannian manifold— not a vector space

• Matrix-log maps a Riemannian manifoldto a vector space

——

out of 3520

CC’

Manifold ofcovariance matrices

matrix-log

log(C)

log(C’)

TUDUC =TUDUC )log(:)log( =

vector space

[Arsigny-Pennec-Ayache’06]

Page 21: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Classification based on sparse linear representation

out of 3521

seg1

seg2

segK

label(1)

label(2)

label(N)

Ψ

C1

C2

CN

Cquery

Matrix-log&

vectorization

Matrix-log&

vectorization

Matrix-log&

vectorization

Matrix-log&

vectorization

pquery

P

...

Decide query label

Estimated query label

Ψ

Ψ

Ψ

pN

p2

p1

[Wright et al., PAMI’09]

Page 22: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Classification algorithm

Each coefficient of weights the contribution of training segments to query segment

out of 3522

Step 1: Compute residual error:

Step 2: Determine query label

2* ||||)( iiqueryqueryi PR α−= pp

)(minarg)( queryiiquery Rlabel pp =

action 1 action 2 action 3

P

P1 P2 P3

*1α

*2α

*3α

Page 23: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Silhouette-based local features

out of 3523

dNE

dE

dSE

dS

dSW

dWdNW

dN

dT+

dT-

f(s) =

xyt

dN

dE…dT+

dT- 13x1

s = yt

x

(x,y,t)

• Dimensionality reductionT

SS S

C ))()()((||

1: μμ −−= ∑∈

sfsfs

Page 24: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Implementation issues• How to process a long video?

out of 3524

— Break it into segments

• What should be the length of segments??

timex

y—Period for human action ≈ 0.4 — 0.8 s

—Segment length ≈ 10 — 20 frames

(@25 fps video)

Page 25: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Action segments

out of 3525

• No knowledge about the beginning and end of periods— Use overlapping segments

• Additional benefits of overlapping segments— Reduced sensitivity to temporal action misalignment— Richer dictionary

Page 26: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Segment-level classification

out of 3526

seg1

seg2

segK

label(1)

label(2)

label(N)

Ψ

C1

C2

CN

Cquery

Matrix-log&

vectorization

Matrix-log&

vectorization

Matrix-log&

vectorization

Matrix-log&

vectorization

pquery

P

...

Decide query label

Estimated query label

Ψ

Ψ

Ψ

pN

p2

p1

[Wright et al., PAMI’09]

Page 27: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

From segment decisions to a sequence decision

out of 3527

walk

• How to get the labels of query video ? — Majority rule

Page 28: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Summary of the approach• Partitioning into segments

28 out of 35

Page 29: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Summary of the approach• Partitioning into segments• Action representation for each segment

video clip covariance matrix

ΨAction representation

silhouette tunnel bag of dense local features

29 out of 35

Page 30: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Summary of the approach• Partitioning into segments

• Segment-wise action recognition• Action representation for each segment

Cquery

C1 C2 CK

Bag of labeled feature covariance matrices

label(1) label(2) label(K)

Action recognition Estimated label of query segment

30 out of 35

Page 31: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Summary of the approach• Partitioning into segments

• Segment-wise action recognition

• Decision fusion

walk

• Action representation for each segment

31 out of 35

Page 32: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Experimental results

out of 3532

• Datasets— Weizmann: 9 persons x 10 actions (180x144)

— UT-Tower: 6 persons x 9 actions x 2 times (about 90x70)

wave1 wave2

run

side skip walk

pjumpjumpjumping-jackbend

point stand dig walk carry

runwave1 wave2jump

Page 33: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

out of 3533

Performances

• Weizmann dataset: (LOOCV, N=8)

Method Proposed NN-based Gorelick Niebles Ali Seo

SEG-CCR 96.74% 97.05% 97.83% — 95.75% —

SEQ-CCR 100% 100% — 90% — 96%

• Correct classification rate (CCR)— SEG-CCR: % of correctly classified query segments — SEQ-CCR: % of correctly classified query sequences

Method Proposed NN-based

SEG-CCR 96.15% 93.53%

SEQ-CCR 97.22% 96.30%

• UT-Tower dataset: (LOOCV, N=8)

Page 34: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

out of 3534

Computational complexity

• Platform: Dual Core 2.2 GHz + 2GB Memory + Matlab 7.6

• Action representation

video: 180 x 144 x 84 — 0.12 sec/frame (8.3fps)

• Action classification

0.07 sec/segment (14.3fps)

Page 35: Action Recognition in Video by Sparse Representation on ...[Ali-Shah-PAMI’10] Action features Action classifiers. Action recognition framework 11 out of 35 ...

Conclusions

• We proposed a novel approach to action recognition:

—action representation = covariance matrix of local features—action classification = sparse-representation-based classifier

• The proposed approach has

— state-of-the-art performance on Weizmann dataset— 100% performance on non-static actions in low-resolution

UT-Tower dataset— low memory requirements with close to real-time

performance

out of 3535