Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick...
Transcript of Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick...
Action Recognition Results
CMU MoCap dataset (multi-view)•Projected tracks for 13 joints, 12 action classes• Simulated noise in joint tracking; six virtual cam-
erastest views
cam5
cam4cam3
cam2
cam1
cam6
trai
ning
view
s
92.1 89.0 76.2 71.3 73.2 84.8 81.1
87.2 92.7 83.5 72.6 64.6 78.7 79.9
78.7 83.5 89.0 90.9 67.7 61.0 78.5
78.0 75.6 88.4 90.9 72.6 63.4 78.2
81.1 73.8 76.8 83.5 95.7 80.5 81.9
86.0 88.4 73.8 76.2 78.0 91.5 82.3
90.9 90.2 87.8 90.9 92.7 90.9 90.5
cam
1ca
m2
cam
3ca
m4
cam
5ca
m6
All
cam
1
cam
2
cam
3
cam
4
cam
5
cam
6
All
80.6 95.2 0.0 96.8 29.2 100.0 97.2 64.6 96.3 100.0 99.6 68.8 90.5
bend
cartw
heels
drink
fjum
pfly
strok
e
golf
jjack
jump
kick
run
walk walktu
rn
All
Weizman dataset (single view)• 9 classes of actions, performed by 9 actors.•NNC recognition accuracy 95.3% with SSM-pos and
94.6% with SSM-of-ofx-ofy-hog, compared to 92.6% [Aliet al. ICCV’07]
IXMAS dataset (multi-view)•Dataset: 5 different views, 11 action classes, per-
formed 3 times by 10 actors.
camera 1 camera 2
camera 3camera 4
camera 5camera 1
camera 2
camera 3
camera 4
camera 5
“check watch” action “punch action” action
View Invariance Properties
• SSM “images” are stable under view changes
Experimental vali-dation with denseview sampling
•Compute gradient orientation at each SSM point•Estimate per-point orientation variance over views
“bend” action “kick” action
10 20 30 40 50 60 70 80 90 100 110
10
20
30
40
50
60
70
80
90
100
110
0
50
100
150
average std: 17.4◦ average std: 20.93◦
SSM-based Descriptor
•Local patch-based descriptorcentered at each point on thediagonal•Compute an 8-bin histogram
of SSM gradients for each ofthe 11 blocks hm
i
Action recognition: we represent each videoas a bag of local SSM descriptors H = (h1, ..., hn). Ap-ply either Nearest Neighbor Classifier (BoF-NNC) orSupport Vector Machine (BoF-SVM).
Temporal Multi-View Video Alignment
•Represent each SSM by sequences of local SSM de-scriptors H1 and H2.•Align H1 and H2 with Dynamic Time Warping.
50 100 150 200 250
50
100
150
200
250
Self-Similarity Matrices (SSM)
Temporal self-similarity matrix is defined as
M = (mi,j)T×T with elements mi,j = k(xi, xj).
mi,j denotes the similarity of observations xi, xj attimes i, j according to some kernel function k.
Trajectory-based SSM• k(xi, xj) =
∑p ||xp
i − xpj||2.
• xpi : 2D point position on track p at time i.
golf: side view golf: top viewA
B
C
time
time
B
A
C
time
time
“bend” action “kick” action
side
view
Time
Tim
e
5 10 15 20 25 30 35
5
10
15
20
25
30
35
Time
Tim
e
5 10 15 20 25 30 35
5
10
15
20
25
30
35
Time
Tim
e
10 20 30 40 50 60
10
20
30
40
50
60
Time
Tim
e
5 10 15 20 25 30 35 40 45 50 55
5
10
15
20
25
30
35
40
45
50
55
top
view
Time
Tim
e
5 10 15 20 25 30 35 40 45
5
10
15
20
25
30
35
40
45
Time
Tim
e
5 10 15 20 25 30 35
5
10
15
20
25
30
35
Time
Tim
e
10 20 30 40 50 60
10
20
30
40
50
60
Time
Tim
e
5 10 15 20 25 30 35 40 45 50 55
5
10
15
20
25
30
35
40
45
50
55
Actor 1 Actor 2 Actor 1 Actor 2
HoG-based SSM• k(xi, xj) = ||xi − xj||2.• xi: [Dalal&Triggs] person HoG descriptor.
“ben
d”
Time
Tim
e
10 20 30 40 50 60
10
20
30
40
50
60
Time
Tim
e
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
Time
Tim
e
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
Time
Tim
e
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Actor 1 Actor 2 Actor 3 Actor 4
Optical Flow based SSM• k(xi, xj) = ||xi − xj||2.• xi: OF components {vx, vy|vx|vy} computed in per-
son bounding box and concatenated into a vector.
“ben
d”
Time
Tim
e
10 20 30 40 50 60
10
20
30
40
50
60
Time
Tim
e
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
Time
Tim
e
10 20 30 40 50 60 70 80
10
20
30
40
50
60
70
80
Time
Tim
e
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Actor 1 Actor 2 Actor 3 Actor 4
Objective•Human actions recognition under view changes.
Related work•Volumetric 3D reconstruction [Weinland et al. CVIU’06]•Body part trajectories, projective geometry [Yilmaz
and Shah ICCV’05], [Parameswaran and ChellapaIJCV’06]•View-stable 2D trajectory features [Rao et al. IJCV’02].•Projective geometry, no point correspondence [Wolf
and Zomet IJCV’06].
Problems• 2D/3D posture recovery is a difficult and gener-
ally unsolved problem.•Direct extension of multiple view geometry meth-
ods to human actions is difficult due to the hardcross-view correspondence problem.•Pure learning approach is difficult due to the lim-
ited number of action samples in different views.
Hypothesis•View-invariance for non-rigid motion might be an
easier problem compared to static scenes due tothe additional time dimension.
This paper•Cross-view action recognition under weak assump-
tions:– Only one test view– Different training view(s)– No 2D/3D reconstruction– No multi-view point correspondence– Assuming bounding box person localization
CROSS-VIEW ACTION RECOGNITION FROM TEMPORAL SELF-SIMILARITIES
Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick Perez
IRISA / INRIA Rennes, Campus universitaire de Beaulieu35042 Rennes Cedex France