Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick...

1
Action Recognition Results CMU MoCap dataset (multi-view) Projected tracks for 13 joints, 12 action classes Simulated noise in joint tracking; six virtual cam- eras test views cam5 cam4 cam3 cam2 cam1 cam6 training views 92.1 89.0 76.2 71.3 73.2 84.8 81.1 87.2 92.7 83.5 72.6 64.6 78.7 79.9 78.7 83.5 89.0 90.9 67.7 61.0 78.5 78.0 75.6 88.4 90.9 72.6 63.4 78.2 81.1 73.8 76.8 83.5 95.7 80.5 81.9 86.0 88.4 73.8 76.2 78.0 91.5 82.3 90.9 90.2 87.8 90.9 92.7 90.9 90.5 cam1 cam2 cam3 cam4 cam5 cam6 All cam1 cam2 cam3 cam4 cam5 cam6 All 80.6 95.2 0.0 96.8 29.2 100.0 97.2 64.6 96.3 100.0 99.6 68.8 90.5 bend cartwheels drink fjump flystroke golf jjack jump kick run walk walkturn All Weizman dataset (single view) 9 classes of actions, performed by 9 actors. NNC recognition accuracy 95.3% with SSM-pos and 94.6% with SSM-of-ofx-ofy-hog, compared to 92.6% [Ali et al. ICCV’07] IXMAS dataset (multi-view) Dataset: 5 different views, 11 action classes, per- formed 3 times by 10 actors. camera 1 camera 2 camera 3 camera 4 camera 5 camera 1 camera 2 camera 3 camera 4 camera 5 “check watch” action “punch action” action View Invariance Properties SSM “images” are stable under view changes Experimental vali- dation with dense view sampling Compute gradient orientation at each SSM point Estimate per-point orientation variance over views “bend” action “kick” action 10 20 30 40 50 60 70 80 90 100 110 10 20 30 40 50 60 70 80 90 100 110 0 50 100 150 average std: 17.4 average std: 20.93 SSM-based Descriptor Local patch-based descriptor centered at each point on the diagonal Compute an 8-bin histogram of SSM gradients for each of the 11 blocks h m i Action recognition: we represent each video as a bag of local SSM descriptors H =(h 1 , ..., h n ). Ap- ply either Nearest Neighbor Classifier (BoF-NNC) or Support Vector Machine (BoF-SVM). Temporal Multi-View Video Alignment Represent each SSM by sequences of local SSM de- scriptors H 1 and H 2 . Align H 1 and H 2 with Dynamic Time Warping. 50 100 150 200 250 50 100 150 200 250 Self-Similarity Matrices (SSM) Temporal self-similarity matrix is defined as M =(m i,j ) T ×T with elements m i,j = k (x i ,x j ). m i,j denotes the similarity of observations x i ,x j at times i, j according to some kernel function k . Trajectory-based SSM k (x i ,x j )= p ||x p i - x p j || 2 . x p i : 2D point position on track p at time i. golf: side view golf: top view A B C time time B A C time time “bend” action “kick” action side view Time Time 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Time Time 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Time Time 10 20 30 40 50 60 10 20 30 40 50 60 Time Time 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 top view Time Time 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 Time Time 5 10 15 20 25 30 35 5 10 15 20 25 30 35 Time Time 10 20 30 40 50 60 10 20 30 40 50 60 Time Time 5 10 15 20 25 30 35 40 45 50 55 5 10 15 20 25 30 35 40 45 50 55 Actor 1 Actor 2 Actor 1 Actor 2 HoG-based SSM k (x i ,x j )= ||x i - x j || 2 . x i : [Dalal&Triggs] person HoG descriptor. “bend” Time Time 10 20 30 40 50 60 10 20 30 40 50 60 Time Time 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 Time Time 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 Time Time 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Actor 1 Actor 2 Actor 3 Actor 4 Optical Flow based SSM k (x i ,x j )= ||x i - x j || 2 . x i : OF components {v x ,v y |v x |v y } computed in per- son bounding box and concatenated into a vector. “bend” Time Time 10 20 30 40 50 60 10 20 30 40 50 60 Time Time 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 Time Time 10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 Time Time 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 Actor 1 Actor 2 Actor 3 Actor 4 Objective Human actions recognition under view changes. Related work Volumetric 3D reconstruction [Weinland et al. CVIU’06] Body part trajectories, projective geometry [Yilmaz and Shah ICCV’05], [Parameswaran and Chellapa IJCV’06] View-stable 2D trajectory features [Rao et al. IJCV’02]. Projective geometry, no point correspondence [Wolf and Zomet IJCV’06]. Problems 2D/3D posture recovery is a difficult and gener- ally unsolved problem. Direct extension of multiple view geometry meth- ods to human actions is difficult due to the hard cross-view correspondence problem. Pure learning approach is difficult due to the lim- ited number of action samples in different views. Hypothesis View-invariance for non-rigid motion might be an easier problem compared to static scenes due to the additional time dimension. This paper Cross-view action recognition under weak assump- tions: Only one test view Different training view(s) No 2D/3D reconstruction No multi-view point correspondence Assuming bounding box person localization C ROSS -V IEW A CTION R ECOGNITION FROM T EMPORAL S ELF -S IMILARITIES Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick P´ erez IRISA / INRIA Rennes, Campus universitaire de Beaulieu 35042 Rennes Cedex France

Transcript of Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick...

Page 1: Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick …lear.inrialpes.fr/RecogWorkshop08/documents/ECCVposter laptev.pdf · Action Recognition Results CMU MoCap dataset ... Body

Action Recognition Results

CMU MoCap dataset (multi-view)•Projected tracks for 13 joints, 12 action classes• Simulated noise in joint tracking; six virtual cam-

erastest views

cam5

cam4cam3

cam2

cam1

cam6

trai

ning

view

s

92.1 89.0 76.2 71.3 73.2 84.8 81.1

87.2 92.7 83.5 72.6 64.6 78.7 79.9

78.7 83.5 89.0 90.9 67.7 61.0 78.5

78.0 75.6 88.4 90.9 72.6 63.4 78.2

81.1 73.8 76.8 83.5 95.7 80.5 81.9

86.0 88.4 73.8 76.2 78.0 91.5 82.3

90.9 90.2 87.8 90.9 92.7 90.9 90.5

cam

1ca

m2

cam

3ca

m4

cam

5ca

m6

All

cam

1

cam

2

cam

3

cam

4

cam

5

cam

6

All

80.6 95.2 0.0 96.8 29.2 100.0 97.2 64.6 96.3 100.0 99.6 68.8 90.5

bend

cartw

heels

drink

fjum

pfly

strok

e

golf

jjack

jump

kick

run

walk walktu

rn

All

Weizman dataset (single view)• 9 classes of actions, performed by 9 actors.•NNC recognition accuracy 95.3% with SSM-pos and

94.6% with SSM-of-ofx-ofy-hog, compared to 92.6% [Aliet al. ICCV’07]

IXMAS dataset (multi-view)•Dataset: 5 different views, 11 action classes, per-

formed 3 times by 10 actors.

camera 1 camera 2

camera 3camera 4

camera 5camera 1

camera 2

camera 3

camera 4

camera 5

“check watch” action “punch action” action

View Invariance Properties

• SSM “images” are stable under view changes

Experimental vali-dation with denseview sampling

•Compute gradient orientation at each SSM point•Estimate per-point orientation variance over views

“bend” action “kick” action

10 20 30 40 50 60 70 80 90 100 110

10

20

30

40

50

60

70

80

90

100

110

0

50

100

150

average std: 17.4◦ average std: 20.93◦

SSM-based Descriptor

•Local patch-based descriptorcentered at each point on thediagonal•Compute an 8-bin histogram

of SSM gradients for each ofthe 11 blocks hm

i

Action recognition: we represent each videoas a bag of local SSM descriptors H = (h1, ..., hn). Ap-ply either Nearest Neighbor Classifier (BoF-NNC) orSupport Vector Machine (BoF-SVM).

Temporal Multi-View Video Alignment

•Represent each SSM by sequences of local SSM de-scriptors H1 and H2.•Align H1 and H2 with Dynamic Time Warping.

50 100 150 200 250

50

100

150

200

250

Self-Similarity Matrices (SSM)

Temporal self-similarity matrix is defined as

M = (mi,j)T×T with elements mi,j = k(xi, xj).

mi,j denotes the similarity of observations xi, xj attimes i, j according to some kernel function k.

Trajectory-based SSM• k(xi, xj) =

∑p ||xp

i − xpj||2.

• xpi : 2D point position on track p at time i.

golf: side view golf: top viewA

B

C

time

time

B

A

C

time

time

“bend” action “kick” action

side

view

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

40

45

50

55

top

view

Time

Tim

e

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

40

45

50

55

Actor 1 Actor 2 Actor 1 Actor 2

HoG-based SSM• k(xi, xj) = ||xi − xj||2.• xi: [Dalal&Triggs] person HoG descriptor.

“ben

d”

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Actor 1 Actor 2 Actor 3 Actor 4

Optical Flow based SSM• k(xi, xj) = ||xi − xj||2.• xi: OF components {vx, vy|vx|vy} computed in per-

son bounding box and concatenated into a vector.

“ben

d”

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Actor 1 Actor 2 Actor 3 Actor 4

Objective•Human actions recognition under view changes.

Related work•Volumetric 3D reconstruction [Weinland et al. CVIU’06]•Body part trajectories, projective geometry [Yilmaz

and Shah ICCV’05], [Parameswaran and ChellapaIJCV’06]•View-stable 2D trajectory features [Rao et al. IJCV’02].•Projective geometry, no point correspondence [Wolf

and Zomet IJCV’06].

Problems• 2D/3D posture recovery is a difficult and gener-

ally unsolved problem.•Direct extension of multiple view geometry meth-

ods to human actions is difficult due to the hardcross-view correspondence problem.•Pure learning approach is difficult due to the lim-

ited number of action samples in different views.

Hypothesis•View-invariance for non-rigid motion might be an

easier problem compared to static scenes due tothe additional time dimension.

This paper•Cross-view action recognition under weak assump-

tions:– Only one test view– Different training view(s)– No 2D/3D reconstruction– No multi-view point correspondence– Assuming bounding box person localization

CROSS-VIEW ACTION RECOGNITION FROM TEMPORAL SELF-SIMILARITIES

Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick Perez

IRISA / INRIA Rennes, Campus universitaire de Beaulieu35042 Rennes Cedex France