Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang...

Modeling Scene and Object Contexts for Human Action

Retrieval with Few Examples

Yu-Gang JiangZhenguo Li

Shih-Fu ChangIEEE Transactions on CSVT 2011

Outline

• Context-based Action Retrieval Framework• Experiment Result• Conclusion

Framework

A. Video Representation and Negative Sample Selection

B. Obtaining Action Context1. Scene Recognition2. Object Recognition

C. Estimating Action-Scene-Object Relationship

D. Incorporationg Multiple Contextual Cues

Context-Based Action Retrival Framework


• Use the bag-of-features framework


• Use the bag-of-features framework• Use k-means clustering to generate 4000

visual words



visual words• Quantize each video clip into two 4000-D

histograms of visual words



visual words• Quantize each video clip into two 4000-D

histograms of visual words• Apply Local and Global Consistency(LGC) [27]

• Pick negative samples after propagation

[27] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” in Proc. Neural Inform. Process. Syst., 2004, pp. 321–328.

B. Scene Recognition

• Train different classifiers for two bag-of-features and simply average their probability predictions

• The scene models are learned by SVM• Adopt 10 scene classes

House Road Bedroom Car Interior Hotel

Kitchen Living Room Office Restaurant Shop

B. Object Recognition

• It can only detect person, chair and car• Define actions– Track objects based on location and box size– Discard isolated detections

• Compute average spatial distance between different types of object

B. Object Recognition


• Define context-based inference score

– Well distinguish samples from P and N

– Produce similar scores if two samples are close


• F : prediction matrix of contextual cues• c : coefficient vector

...

…m contextual cues

n training samples

c

F × ...


Constraint 1 Constraint 2

D. Incorporating Multiple Contextual Cues

• Given an action a and a test sample x

: context weight parameter: the prediction score of contextual cues on x: action prediction score based on raw visual features: refined prediction after incorporating contextual cues

AnswerPhone DriveCar Eat Kiss GetOutCar HandShake

FightPerson HugPerson Run SitDown SitUP StandUp

Experiment Results

• Mean average precision(mAP)• Retrieval Performance by Raw Features

Experiment Results

• Scene vs. Object

Experiment Results

• Comparison to the state of art– SVM learning– Movie script-mining

Conclusion

• An algorithm based on semi-supervised learning paradigm is used to model action-scene-object dependency from limited samples

• This algorithm can be applied to other types of action videos

Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang...

Documents

Transcript of Modeling Scene and Object Contexts for Human Action Retrieval with Few Examples Yu-Gang Jiang...