M. S. Ryoo and J. K. Aggarwal ICCV2009

22
Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities M. S. Ryoo and J. K. Aggarwal ICCV2009

description

Spatio -Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities. M. S. Ryoo and J. K. Aggarwal ICCV2009. Introduction. Human activity recognition, an automated detection of ongoing activities from video is an important problem . - PowerPoint PPT Presentation

Transcript of M. S. Ryoo and J. K. Aggarwal ICCV2009

Page 1: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Spatio-Temporal Relationship Match:Video Structure Comparison for Recognition of

Complex Human Activities

M. S. Ryoo and J. K. AggarwalICCV2009

Page 2: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Introduction

• Human activity recognition, an automated detection of ongoing activities from video is an important problem.

• This technology can use on surveillance systems, robots, human-computer interface.

• When using on serveillance systems,automaically detect violent activities is very important.

Page 3: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Introduction

• Spatial-temporal feature-based approaches have been proposed by many researchers.

• The method above have benn successful on short video containing simple action such as “walking” and “waving”.

• In real-world applications, actions and activities are seldom like this.

Page 4: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Related works

• Methods focused on tracking persons and bodies are developed [4,11] ,but their results rely on background subtraction.

• Approaches that analyze a 3-S XYT volume gained particular in past few years[3,5,6,9,13,16] , they extracted relationship on features and trained a model.

Page 5: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

• [3] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behaviorrecognition via sparse spatio-temporal features. In IEEEInternational Workshop on VS-PETS, pages 65–72, 2005.

• [4] S. Hongeng, R. Nevatia, and F. Bremond. Video-based eventrecognition: activity representation and probabilistic recognitionmethods. CVIU, 96(2):129–162, 2004.

• [5] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In ICCV, 2007.

• [6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008.

• [9] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), Sep 2008.

• [11] M. S. Ryoo and J. K. Aggarwal. Semantic representation and recognition of continued and recursive human activities. IJCV, 82(1):1–24, April 2009.

• [13] C. Schuldt, I. Laptev, and B. Caputo. Recognizing humanactions: a local svm approach. In ICPR, 2004.

• [16] S.-F. Wong, T.-K. Kim, and R. Cipolla. Learning motion categories using both semantic and structural information. In CVPR, 2007.

Page 6: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Related works

• In this paper, we propose a new spatial-temporal feature-based methodology.

• Kernel functions are built on relationship between features.

• After training features , match function uses for matching test data.

Page 7: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Example matching result

Page 8: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Spatio-temporal relationship match

• The method is based on matching two videos and output a real number for result.

• K : V x V R• V -> input video , R-> result

Page 9: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Features and their relations

• A spatial-temporal feature extractor [3,14]detects each interest point locating a salient change.

Page 10: M. S.  Ryoo and J. K.  Aggarwal ICCV2009
Page 11: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Features and their relations

• f= (fdes,floc)• fdes ->descriptor ,floc-> 3-D coordinate• The features are clustered into k types using

k-means on fdes.

Page 12: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Features and their relations

• Each floc have n elements, f1loc,…..fn

loc.• There are types to describe temporal

relations:

Page 13: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Features and their relations

• Spatial relation are described below:

Page 14: M. S.  Ryoo and J. K.  Aggarwal ICCV2009
Page 15: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Features and their relations

Page 16: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Human activity recognition

• Our system maintains one training dataset Dα per activity α.

• Let Dαm extracted from mth training video in

the set Dα, then use the matching function.

Page 17: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Localization

Page 18: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Hierarchical recognition

• We can combine low-level action into high-level action.

• For instance, hand-shake includes two sub-action, “arm streching” and “arm withdrawing”.

• Detecting hand-shake may like : st1 before wd1,st2 before wd2,

st1 equals st2 ,wd1 equals wd2.

Page 19: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Experiments

• The dataset is UT-interaction dataset.• The actions are performed by actors, each

video contains shake hands,point,hug,push,kick and punch.

Page 20: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Experiments

Page 21: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Experiments

Page 22: M. S.  Ryoo and J. K.  Aggarwal ICCV2009

Conclusion

• This method rely on the extracted feature and spatial-temporal relationship on features.

• Can hierarchically detect high-level actions.• Miss-detect on unusual feature combination.