Watching Unlabeled Video Helps Learn New Human Actions...
Transcript of Watching Unlabeled Video Helps Learn New Human Actions...
Reading
Walking
Riding horse
Phoning
Running
Playing instrument Taking photo Using computer
Riding bike
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin
Problem Goal: learn actions from static images
Approach
Related Work
Represent body pose
Results
Recognizing activity in images
Conclusions
- Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data.
Representing body pose
Expand snapshots by pose dynamics learned from videos
Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action.
Hollywood Human Actions (unlabeled)
PASCAL VOC 2010 (11 verbs) Image
Stanford 40 Actions (10 selected verbs) Image
Recognizing activity in videos
- Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples.
Unlabeled video data
Synthesize pose examples:
+ Few canonical snapshots Unlabeled video pool (generic human action)
Given training snapshot Infer pose before Infer pose after
+
Detected poselets PAV
P1
P2
P3 … P1 P2 P3
Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable.
Training snapshot Matched frame
Find
Nearby synthetic poses
Generate
1) Example based strategy
2) Manifold based strategy
- Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background
Poselet feature
Unlabeled video pool Training image
before after
t t+T t-T
Locally Linear Embedding [Roweis and Saul 2000]
Similarity function for temporal nearness and pose similarity:
Pose
Training image
Frames from video (source)
Static snapshots (target)
+
+
- -
-
-
- -
-
-
- Real
-
Overview Synthesize pose examples Few static snapshots
Unlabeled video data
Train with real & synthetic poses
Temporal
Train with real & synthetic poses
+
+
- -
-
-
- -
-
-
- +
+ +
+
Real Synthetic poses
-
Domain adaptation by [Daume III 2007]
Pose feature space Pose feature space
PASCAL(# train = 15) Stanford 10(# train = 17)
0.1
0.2
0.3
0.4
Accu
racy
(mAP
)
PASCAL Stanford 10 # of training images # of training images
15
0.1
0.2
0.3
0.1
0.2
17 20 30 37 50 20 25 33 40 50
Accu
racy
(mAP
)
Pose diversity among training images
Our
acc
urac
y ga
in
Most benefit when lack pose diversity Synthetic after Real trainings
Synthetic feature examples
Accu
racy
0.2
0.4
0.6
- Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots.
Training Testing
Recognize action Our idea
Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots.
Synthetic before
Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data.
Generate Generate