Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang...
Modeling Scene and Object Contexts for Human Action
Retrieval with Few Examples
Yu-Gang JiangZhenguo Li
Shih-Fu ChangIEEE Transactions on CSVT 2011
Framework
A. Video Representation and Negative Sample Selection
B. Obtaining Action Context1. Scene Recognition2. Object Recognition
C. Estimating Action-Scene-Object Relationship
D. Incorporationg Multiple Contextual Cues
A. Video Representation and Negative Sample Selection
• Use the bag-of-features framework• Use k-means clustering to generate 4000
visual words
A. Video Representation and Negative Sample Selection
• Use the bag-of-features framework• Use k-means clustering to generate 4000
visual words• Quantize each video clip into two 4000-D
histograms of visual words
A. Video Representation and Negative Sample Selection
• Use the bag-of-features framework• Use k-means clustering to generate 4000
visual words• Quantize each video clip into two 4000-D
histograms of visual words• Apply Local and Global Consistency(LGC) [27]
• Pick negative samples after propagation
[27] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inform. Process. Syst., 2004, pp. 321–328.
B. Scene Recognition
• Train different classifiers for two bag-of-features and simply average their probability predictions
• The scene models are learned by SVM• Adopt 10 scene classes
House Road Bedroom Car Interior Hotel
Kitchen Living Room Office Restaurant Shop
B. Object Recognition
• It can only detect person, chair and car• Define actions– Track objects based on location and box size– Discard isolated detections
• Compute average spatial distance between different types of object
C. Estimating Action-Scene-Object Relationship
• Define context-based inference score
– Well distinguish samples from P and N
– Produce similar scores if two samples are close
C. Estimating Action-Scene-Object Relationship
• F : prediction matrix of contextual cues• c : coefficient vector
...
…m contextual cues
n training samples
c
F × ...
D. Incorporating Multiple Contextual Cues
• Given an action a and a test sample x
: context weight parameter: the prediction score of contextual cues on x: action prediction score based on raw visual features: refined prediction after incorporating contextual cues
AnswerPhone DriveCar Eat Kiss GetOutCar HandShake
FightPerson HugPerson Run SitDown SitUP StandUp