Representation for Action Recognition Using Trajectory-Based...

Research ArticleRepresentation for Action Recognition Using Trajectory-BasedLow-Level Local Feature and Mid-Level Motion Feature

Xiaoqiang Li DanWang and Yin Zhang

School of Computer Engineering and Sciences Shanghai University Shanghai 200444 China

Correspondence should be addressed to Xiaoqiang Li xqliishueducn

Received 4 August 2016 Revised 22 March 2017 Accepted 17 September 2017 Published 19 October 2017

Academic Editor Francesco Carlo Morabito

Copyright copy 2017 Xiaoqiang Li et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The dense trajectories and low-level local features are widely used in action recognition recently However most of these methodsignore the motion part of action which is the key factor to distinguish the different human action This paper proposes a newtwo-layer model of representation for action recognition by describing the video with low-level features and mid-level motionpart model Firstly we encode the compensated flow (119908-flow) trajectory-based local features with Fisher Vector (FV) to retain thelow-level characteristic of motion Then the motion parts are extracted by clustering the similar trajectories with spatiotemporaldistance between trajectories Finally the representation for action video is the concatenation of low-level descriptors encodingvector and motion part encoding vector It is used as input to the LibSVM for action recognition The experiment resultsdemonstrate the improvements on J-HMDB and YouTube datasets which obtain 674 and 876 respectively

1 Introduction

Human action recognition has become a hot topic in thefield of computer vision It has developed a practical sys-tem which will be applied to video surveillance interactivegaming and video annotation Despite remarkable researchefforts and many encouraging advances in recent years [1ndash3] action recognition is still far from being satisfactory andpractical There are large factors affecting accurate rate of therecognition such as cluttered background illumination andocclusion

Most action recognition focuses on two important issuesextracting features within a spatiotemporal volume andmodeling the action patterns Many existing researches onhuman action recognition tend to extract features fromwhole3D videos using spatiotemporal interest points (STIP) [4] Inrecent years optical flow is applied to extract the trajectory-based motion features which have been widely used inlocal spatiotemporal features Local trajectory-based featuresare pooled and normalized to a vector as the video globalrepresentation in action recognition Meanwhile a lot ofwork has focused on developing discriminative dictionaryfor image object recognition or video action recognitionTheBag of Feature (BOF)model generates simple videomodel by

clustering spatiotemporal features of all the training samplesand is trained using 1205942-119896119890119903119899119890119897 Support Vector Machine(SVM) And the state of the art method is popular FisherVector (FV) [5] encoding model based on spatiotemporallocal features However all these methods are not perfectbecause they are only concerned about the low-level spa-tiotemporal features based on interest point and ignored thehigher level features of motion part For most actions onlya small subset of local motion features of the entire video isrelevant to the action label When a person is waving onlythe movement around the arm or hand is responsible for theaction clapping hand Action Bank [6] and motionlets [7]adopt unsupervised learning to discover action parts Manymethods [8] cluster the trajectories and seek to understandspatiotemporal properties of movement to construct themid-level action video representation The Vector of LocallyAggregated Descriptors (VLAD) [9] is a descriptor encodingtechnique that aggregates the descriptors based on a localitycriterion in the feature space To keep more spatiotemporalcharacteristics of the processedmotion part VLAD encodinggets better results than BOF by [10] Inspired by low-levellocal feature encoding and mid-level motion part model arekey factors to distinguish the different human actions wepropose a new representation (depicted in Figure 2) for action

HindawiApplied Computational Intelligence and So ComputingVolume 2017 Article ID 4019213 7 pageshttpsdoiorg10115520174019213

2 Applied Computational Intelligence and Soft Computing

(a) iDT vectors (b) 119908-flow DT vectors

Figure 1 A comparison of the iDT trajectories and 119908-flow dense trajectories The red dot is the end point of the green optical flow vector inthe current frame (a) The optical flow vectors are tracked by the improve dense trajectories using SURF descriptors matching [12] (b) Mostof flow vectors due to camera motion are removed by 119908-flow method

recognition based on local features and motion part in thispaper To reduce the background clutter noise we extract thelocal trajectory-based features through a better compensatedflow (119908-flow) [11] dense trajectories methodThen we clusterthe trajectories through the graph clustering algorithm andencode the group features to describe the different motionpart Finally we represent the video through combiningthe low-level trajectory-based features encoding model withmid-level motion part model

This paper is organized as follows In Section 2 the localdescriptors based on the 119908-flow dense trajectories and low-level video encoding with FV are introduced Then we showclustering the motion part and introduce the representationfor video in Section 3 We describe the evaluation of ourapproach and discuss the results in Section 4 Finally theconclusion and future works are discussed in Section 5

2 First Layer with FV

Trajectories are efficient in capturing object motions invideos We extract spatiotemporal features along the 119908-flowdense trajectories to express low-level descriptors In thissection we introduce the 119908-flow dense trajectories and low-level descriptors with FV

21 119908-Flow Dense Trajectory The idea of dense trajectory isbased on tracking the interest points The interest points aresampled on a grid spaced by 119882 = 5 pixels and tracked ineach frame Points of subsequent frames are concatenated toform a trajectory (119901119905 119901119905+119871)119901119905 = (119909119905 119910119905) is the position ofinterest points at frame 119905 The length of a trajectory is 119871 = 15frames [1] A recent work by Jain et al [11] proposed thecompensated flow (119908-flow) dense trajectories which reducethe impact of the background trajectories The 119908-flow densetrajectory is obtained by removing the affine flow vectorfrom the original optical flow vector The interest point ofthis method is tracked by 119908-flow [11] for compensatingdominant motion (camera motion) It is beneficial for mostof the existing descriptors used for action recognition Thismethod uses the 2D polynomial affine motion model forcompensating camera motion The affine flow 119908aff (119901119905) is themain movement of the two consecutive images which is

usually caused by the movement of the camera We computethe affine flow with the publicly available Motion2D software(httpwwwirisafrvistaMotion2D) which implements areal-time robust multiresolution incremental estimationframework The final flow vector 119908(119901119905) at point 119901119905 = (119909119905 119910119905)is obtained by removing the affine flow vector 119908aff (119901119905) =(119906(119901119905) V(119901119905)) from the original optical flow vector as follows

119908 (119901119905) = 119908 (119901119905) minus 119908aff (119901119905) (1)

Figure 1 shows the dense trajectories extracted by the iDT[12] method and the 119908-flow dense trajectories

The shape of a trajectory encodes local motion patternsThe shape of a trajectory is described by concatenating a setof displacement vectors Δ119875119905 = (119901119905+1 minus 119901119905) = (119909119905+1 minus 119909119905 119910119905+1 minus119910119905) Meanwhile to leverage the motion information in densetrajectories we compute descriptors within a spatiotemporalvolume around the trajectoryThe size of a volume is 32 times 32And the volume is divided into a 2 times 2 times 3 spatiotemporalgrid The Histograms of Optical Flow (HOF and 119908-HOF)[1] descriptor captures the local motion information whichis computed using the orientations and magnitudes of theflow field Motion boundary histogram (MBH) [1] descriptorencodes the relative motion between pixels which is alongboth 119909 and 119910 image axis and describes the discriminatoryfeatures for the action recognition in background clutteringThe trajectory-based 119908-HOF features is computed on thecompensated flow For each trajectory the descriptors com-binemotion informationHOF119908-HOF andMBHThe singletrajectory feature is in the form of

119865 = (119908-HOFHOFMBH119909MBH119910 119878) (2)

The trajectory shape 119878 = (Δ119875119905 Δ119875119905+119871minus1)sum119905+119871minus1119895=119905 Δ119875119895 isnormalized by the sumof themagnitudes of the displacementvectors and 119871 is the length of trajectories

22 Low-Level Video Encoding The representation of video isa vital problem in action recognitionWe first encode the low-level 119908-flow trajectory-based descriptors using the FisherVector (FV) encodingmethod which was proposed for image

Applied Computational Intelligence and Soft Computing 3

Trajectories groups

Linear kernelSVM classifier

Findcorresponding

1

2

3

N

VLADencoding

Second layer

Local descriptorsGMM codebook FV encoding

Feature encoding

w-flow densetrajectories

First layer

Inputvideo

Representation forvideo

Trajectoriesgroups encoding

Descriptorsextraction

S

HOFw-HOFMBH

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2

Figure 2The recognition framework of the two-layer model The first layer encodes the low-level 119908-flow trajectory-based descriptors usingthe Fisher Vector (FV) The second layer describes motion part with trajectories groups

categorization [13] FV is derived from Fisher kernel whichencodes the statistics between video descriptors andGaussianMixtureModel (GMM)We reduce the low-level features (119908-HOF HOF and MBH) dimensionality by PCA keeping the90 energy The local descriptors 119883 can be modeled by aprobability density function119901(119883 120579)with parameters 120579 whichis usually modeled by GMM

119866119883120579 = 1119873nabla120579 log119901 (119883 120579) 120579 = 1199081 1205831 1205751 119908119896 120583119896 120575119896

(3)

where119908 120583 120575 are the model parameters denoting the weightsmeans and diagonal covariances of GMM 119873 is the numberof local descriptors 119896 is the number of mixture componentsand we set 119896 to 256 [5] We can compute the gradient of thelog likelihood with respect to the parameters of the modelto represent a video FV requires a GMM of the encodedfeature distribution The Fisher Vector is the concatenationof these partial derivatives and describes in which directionthe parameters of themodel should bemodified to best fit thedata [14] To keep the low-level feature we encode each videowith the FV encoding feature

3 Representation for VideoMotion part encoding has already been identified as a suc-cessful method to represent the video for action recognitionIn this section we use a graph clustering method to clusterthe similarity trajectories into groups Then representationfor action video is concatenation of low-level local descriptorsencoding and high-level motion part encoding

31 Trajectories Group To better describe the motion wecluster the similarity trajectories into groups because critical

regions of the video are relevant to a specific action In themethod of [22] they compute a hierarchical clustering ontrajectories to yield trajectories group of action parts Thenwe apply that efficient greedy agglomerative hierarchicalclustering procedure to group the trajectories There are alarge number of trajectories in a video that is there are a largenumber of nodes in graph By removing trajectories distancewhich is not spatially close will get a sparse trajectories graphGreedy agglomerative hierarchical clustering is a fast scalablealgorithm with almost linear complexity in the number ofnodes for relatively sparse trajectoriesrsquo graph To group thetrajectories we set 119873 times 119873 trajectories distance matrix for avideo containing 119873 trajectories We use a distance metricbetween trajectories taking into consideration their spatialand temporal relations to cluster Given two trajectories119875119886(119905)119879119886119905=119905119886 and 119875119887(119905)119879119887119905=119905119887 119889 (119886 119887) = max 119889119904 (119905) sdot 11198792 minus 1198791

1198792sum119905=1198791

119889vel (119905) 119905 isin [1198791 1198792]

119889119904 (119905) = 1003816100381610038161003816119875119886 (119905) minus 119875119887 (119905)10038161003816100381610038162 119889vel = 1003816100381610038161003816Δ119875119886 (119905) minus Δ119875119887 (119905)10038161003816100381610038162 (4)

where 119889119904 and 119889vel are the 1198712 distances of the trajectorypoints at corresponding time instances We just calculate thedistance between the trajectories 119875119886 and 119875119887 simultaneouslyexisting in [1198791 1198792] To ensure the spatial compactness of theestimated groups we enforce the above affinity to be zerofor trajectory pairs that are not spatially close 119889119904 ge 30 Thenumber of clusters in a video is set as the number used in[22] and the number of trajectories in a cluster is below the100 based on empirical value


32 Second Layer with VLAD The trajectory group describ-ing the motion part in the same action categories will havesimilarities To capture the coarser spatiotemporal character-istics of the descriptors in the group 119896 we compute the meanof group descriptors (119908-HOFHOF andMBH) and trajectoryshapes Then we concatenate all the group descriptors (119908-HOF HOF and MBH) as 119866119903 and group shape as the groupdescriptors 119866119892 So the group is described as 119866 = 119866119903 119866119892VLAD [9] is a descriptor encoding technique that aggregatesthe descriptors based on a locality criterion in the featurespace As we know the classic BOF uses the clustering centersstatistics to represent the sample which will result in the lossof the lots of information In group encoding we denote thecode words in the group codebook as 1198881 1198882 119888119896 The groupdescriptors 119866119896119894 = 119866119903 119866119892 are all the group descriptors thatbelong to the 119896th wordThe video will be encoded as a vector

V = ( 1198991sum119894=1

(1198661119894 minus 1198881) 119899119896sum119894=1

(119866119896119894 minus 119888119896)) (5)

where 119896 is the size of codebook learned by the 119896-means clus-tering So the VLAD keeps more information than the BOF

33 Video Encoding We encode each video from the groupdescriptors ofmotion part usingVLADmodelThe codebookfor each kind of group descriptors (119908-HOF HOF MBHand 119878) was separately constructed by using 119870-means clusterAccording to the average number of groups in every videowe set the number of visual words to 50 In order to findthe nearest center quickly we construct a KD-tree when eachgroup descriptors are mapped to the codebook We describevideo encoding vector with the group model for differentdescriptors Then motion part model is encoded by theconcatenation of different descriptors of the group VLADmodel Finally the representation for action recognition isencoded by the concatenation of low-level local descriptorsencoding and mid-level motion part encoding Figure 2shows an overview of our pipeline for action recognition

4 Experiments

In this section we implement some experiments to evaluatethe performance of representation for action We validateour model on several action recognition benchmarks andcompare our results with different methods

41 Datasets We validate our model on three standarddatasets for human action KTH J-HMDB and YouTubedataset The KTH dataset views actions in front of a uniformbackground whereas the J-HMDB dataset [10] and YouTubedataset [16] are collected from a variety of sources rangingfrom digitizedmovies to YouTubeThey cover different scalesand difficulty levels for action recognition We summarizethem and the experimental protocols as follows

The KTH dataset [23] contains 6 action categorieswalking handclapping hand waving jogging running andwalking The background is homogeneous and static in mostsequences We follow the experimental setting [23] dividingthe dataset into the train set and test setWe train amulticlass

4719522 539

584 573652

612674

0

10

20

30

40

50

60

70

80

Accu

racy

()

Accuracy ()

Tra

HO

F

w-H

OF

MBH

HO

F + w

-HO

F

HO

F +

MBH

MBH

+ w

-HO

F

MBH

+ w

-HO

F +

HO

F

Figure 3 Illustration on the effect of our descriptors with FVencoding Each bar corresponds to one of the feature descriptors orfeature combinations

classifier and report average accuracy over all classes asperformance measure

The J-HMDB [10] contains 21 action categories brushhair catch clap climb stairs golf jump kick ball pick pourpull-up push run shoot ball shoot bow shoot gun sit standswing baseball throw walk and wave J-HMDB is a subset ofthe HMDB51 which is collected from the movies or InternetThis dataset excludes categories from HMDB51 that containfacial expressions like smiling and interactions with otherssuch as shaking hands and focuses on single body action Weevaluate the J-HMDB which contains 11 categories involvingone single body action For multiclass classification we usethe one-vs-rest approach

The YouTube Action dataset [16] contains 11 actioncategories basketball biking diving golf swinging horseriding soccer juggling swinging tennis swinging trampo-line jumping volleyball spiking and walking with a dogBecause of the large variations in cameramotion appearanceand pose it is a challenging dataset Following [16] we useleave-one-group-out cross-validation and report the averageaccuracy over all classes

42 Experiment Result The proposed method extract one-scale 119908-flow trajectory-based local features through trackingdense sampling interest points and then cluster the trajecto-ries into groups to encode motion part

In order to choose a discriminative combination offeatures to represent the low-level local descriptors weevaluate the low-level local descriptors based on119908-flowdensetrajectories with Fisher Vector encoding in the first baselineexperiment GMM with 256 components is learned from asubset of 256000 randomly selected trajectory-based localdescriptors Linear SVM with 119888 = 100 is used as classifierWe compare different feature descriptors in Figure 3 where


500

550

600

650

700

750

800

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH

Low-level encodingOur method

(a) J-HMDB dataset

68

73

78

83

88

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(b) YouTube dataset

Figure 4 The accuracy of different features comparisons between low-level and two-layer model (a) The comparison on J-HMDB dataset(b) The comparison on YouTube dataset

Table 1 The accuracy comparisons of representation for actionrecognition on J-HMDB dataset and YouTube dataset

Datasets Features Low-levelencoding

Two-layermodel

JHMDBHOF 522 561

119908-HOF 539 573MBH 584 629

HOF + 119908-HOF + MBH 629 674

YouTubeHOF 776 779

119908-HOF 752 770MBH 842 850

HOF + 119908-HOF + MBH 853 876

the average accuracy on J-HMDB dataset is reported It canbe seen that MBH descriptors encoding the relative motionbetween pixels work better than other descriptors Figure 3also shows that the combination of HOF 119908-HOF and MBHdescriptors achieves 674 which is the highest precisionamong all kinds of the low-level local descriptors So we usethis combination in the second experiment

In the second baseline experiment the proposed two-layer model of the representation for action is the con-catenation of low-level local descriptors and motion partdescriptors encoding Table 1 and Figure 4 compare the two-layer method with the low-level method for J-HMDB andYouTube datasets It can be seen that the two-layer model

had better performance than the low-level encoding usingdifferent descriptors In addition we compare the proposedmethod with a few classic methods on KTH J-HMDBand YouTube datasets such as DT + BoVW [1] mid-levelparts [21] traditional FV [17] stacked FV [17] DT + BOW[10] and IDT + FV [17] As shown in Table 2 the two-layer model obtains 674 and 876 accuracy on J-HMDBand YouTube datasets respectively And the recognitionaccuracy is improved by 46 on J-HMDB dataset and 22on YouTube dataset compared with other state of the artmethods However the performance on KTH dataset of theproposed method is not the same better as on the J-HMDBand YouTube datasets because the KTH dataset is collectedby the fixed camera with homogeneous background and theadvantage of the119908-flow trajectories is not shown in this case

5 Conclusions

This paper proposed a two-layer model of representation foraction recognition based on local descriptors andmotion partdescriptors which achieved an improvement compared to thelow-level local descriptors Not only did it consider makinguse of low-level local information to encoding the video butalso it combined the motion part to represent the video Italso presented a discriminative and compact representationfor action recognition However there is still room forimprovement First the proposed method cannot determinethe number of groups in different datasets while the numberof groups affects the performance of mid-level encoding a


Table 2 Accuracy comparisons of different methods on KTH YouTube and J-HMDB datasets

KTH YouTube J-HMDBISA [15] 865 Liu et al [16] 712 Traditional FV [17] 6283Yeffet and Wolf [18] 901 Ikizler-Cinbis and Sclaroff [19] 7521 Stacked FV [17] 5927Cheng et al [20] 897 DT + BoVW [1] 854 DT + BOW [10] 566Le et al [15] 939 Mid-level parts [21] 845 IDT + FV [17] 628Two-layer model 926 Two-layer model 876 Two-layer model 674

lot Second many groups in video do not represent theaction part it is needed to develop a method to learn thediscriminately groups for better representation of the videoIn the future we will do research on new group clusteringmethod which can find the more discriminative groups ofaction part

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] H Wang A Klaser C Schmid and C-L Liu ldquoDense trajecto-ries and motion boundary descriptors for action recognitionrdquoInternational Journal of Computer Vision vol 103 no 1 pp 60ndash79 2013

[2] Y Wang B Wang Y Yu Q Dai and Z Tu ldquoAction-GonsAction recognition with a discriminative dictionary of struc-tured elements with varying granularityrdquo Lecture Notes inComputer Science (including subseries Lecture Notes in ArtificialIntelligence and Lecture Notes in Bioinformatics) Preface vol9007 pp 259ndash274 2015

[3] I LaptevMMarszałek C Schmid and B Rozenfeld ldquoLearningrealistic human actions frommoviesrdquo in Proceedings of the 26thIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo08) June 2008

[4] I Laptev ldquoOn space-time interest pointsrdquo International Journalof Computer Vision vol 64 no 2-3 pp 107ndash123 2005

[5] J Wu Y Zhang andW Lin ldquoTowards good practices for actionvideo encodingrdquo in Proceedings of the 27th IEEE Conference onComputer Vision and Pattern Recognition CVPR 2014 pp 2577ndash2584 Columbus OH USA June 2014

[6] S Sadanand and J J Corso ldquoAction bank a high-level represen-tation of activity in videordquo in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo12) pp1234ndash1241 June 2012

[7] L M Wang Y Qiao and X Tang ldquoMotionlets Mid-level 3Dparts for human motion recognitionrdquo in Proceedings of the 26thIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo13) pp 2674ndash2681 June 2013

[8] W Chen and J J Corso ldquoAction detection by implicit inten-tional motion clusteringrdquo in Proceedings of the 15th IEEEInternational Conference on Computer Vision ICCV 2015 pp3298ndash3306 chl December 2015

[9] H Jegou F Perronnin M Douze J Sanchez P Perez andC Schmid ldquoAggregating local image descriptors into compactcodesrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 34 no 9 pp 1704ndash1716 2012

[10] H Jhuang J Gall S Zuffi C Schmid andM J Black ldquoTowardsunderstanding action recognitionrdquo in Proceedings of the 201314th IEEE International Conference on Computer Vision ICCV2013 pp 3192ndash3199 aus December 2013

[11] M Jain H Jegou and P Bouthemy ldquoBetter exploiting motionfor better action recognitionrdquo in Proceedings of the 26th IEEEConference on Computer Vision and Pattern Recognition CVPR2013 pp 2555ndash2562 Portland OR USA June 2013

[12] H Wang and C Schmid ldquoAction recognition with improvedtrajectoriesrdquo in Proceedings of the 14th IEEE International Con-ference on Computer Vision (ICCV rsquo13) pp 3551ndash3558 SydneyAustralia December 2013

[13] F Perronnin J Sanchez and T Mensink ldquoImproving the fisherkernel for large-scale image classificationrdquo in Proceedings of the11th European Conference on Computer Vision (ECCV rsquo10) vol6314 of Lecture Notes in Computer Science pp 143ndash156 CreteGreece 2010

[14] G Csurka and F Perronnin ldquoFisher vectors Beyond bag-of-visual-words image representationsrdquo Communications inComputer and Information Science vol 229 pp 28ndash42 2011

[15] Q V Le W Y Zou S Y Yeung and A Y Ng ldquoLearning hierar-chical invariant spatio-temporal features for action recognitionwith independent subspace analysisrdquo in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPRrsquo11) pp 3361ndash3368 June 2011

[16] J Liu J Luo and M Shah ldquoRecognizing realistic actions fromvideos in the wildrdquo in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPRrsquo09) pp 1996ndash2003 IEEE Miami Fla USA June 2009

[17] X Peng C Zou Y Qiao and Q Peng ldquoAction recognitionwith stacked fisher vectorsrdquo in Computer VisionmdashECCV 201413th European Conference Zurich Switzerland September 6ndash122014 Proceedings Part V vol 8693 of Lecture Notes in ComputerScience pp 581ndash595 Springer Berlin Germany 2014

[18] L Yeffet and L Wolf ldquoLocal trinary patterns for human actionrecognitionrdquo in Proceedings of the 12th International Conferenceon Computer Vision (ICCV rsquo09) pp 492ndash497 Kyoto JapanOctober 2009

[19] N Ikizler-Cinbis and S Sclaroff ldquoObject scene and actionsCombining multiple features for human action recognitionrdquoLecture Notes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes in Bioinformat-ics) Preface vol 6311 no 1 pp 494ndash507 2010

[20] G Cheng Y Wan W Santiteerakul S Tang and B P BucklesldquoAction recognitionwith temporal relationshipsrdquo in Proceedingsof the 2013 IEEE Conference on Computer Vision and PatternRecognition Workshops CVPRW 2013 pp 671ndash675 PortlandOR USA June 2013

[21] M Sapienza F Cuzzolin and P H S Torr ldquoLearning discrim-inative space-time action parts from weakly labelled videosrdquoInternational Journal of Computer Vision vol 110 no 1 pp 30ndash47 2014


[22] M Raptis I Kokkinos and S Soatto ldquoDiscovering discrim-inative action parts from mid-level video representationsrdquo inProceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR rsquo12) June 2012

[23] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 August 2004

Submit your manuscripts athttpswwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014


Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in



(a) iDT vectors (b) 119908-flow DT vectors

Figure 1 A comparison of the iDT trajectories and 119908-flow dense trajectories The red dot is the end point of the green optical flow vector inthe current frame (a) The optical flow vectors are tracked by the improve dense trajectories using SURF descriptors matching [12] (b) Mostof flow vectors due to camera motion are removed by 119908-flow method

recognition based on local features and motion part in thispaper To reduce the background clutter noise we extract thelocal trajectory-based features through a better compensatedflow (119908-flow) [11] dense trajectories methodThen we clusterthe trajectories through the graph clustering algorithm andencode the group features to describe the different motionpart Finally we represent the video through combiningthe low-level trajectory-based features encoding model withmid-level motion part model

This paper is organized as follows In Section 2 the localdescriptors based on the 119908-flow dense trajectories and low-level video encoding with FV are introduced Then we showclustering the motion part and introduce the representationfor video in Section 3 We describe the evaluation of ourapproach and discuss the results in Section 4 Finally theconclusion and future works are discussed in Section 5

2 First Layer with FV

Trajectories are efficient in capturing object motions invideos We extract spatiotemporal features along the 119908-flowdense trajectories to express low-level descriptors In thissection we introduce the 119908-flow dense trajectories and low-level descriptors with FV

21 119908-Flow Dense Trajectory The idea of dense trajectory isbased on tracking the interest points The interest points aresampled on a grid spaced by 119882 = 5 pixels and tracked ineach frame Points of subsequent frames are concatenated toform a trajectory (119901119905 119901119905+119871)119901119905 = (119909119905 119910119905) is the position ofinterest points at frame 119905 The length of a trajectory is 119871 = 15frames [1] A recent work by Jain et al [11] proposed thecompensated flow (119908-flow) dense trajectories which reducethe impact of the background trajectories The 119908-flow densetrajectory is obtained by removing the affine flow vectorfrom the original optical flow vector The interest point ofthis method is tracked by 119908-flow [11] for compensatingdominant motion (camera motion) It is beneficial for mostof the existing descriptors used for action recognition Thismethod uses the 2D polynomial affine motion model forcompensating camera motion The affine flow 119908aff (119901119905) is themain movement of the two consecutive images which is

usually caused by the movement of the camera We computethe affine flow with the publicly available Motion2D software(httpwwwirisafrvistaMotion2D) which implements areal-time robust multiresolution incremental estimationframework The final flow vector 119908(119901119905) at point 119901119905 = (119909119905 119910119905)is obtained by removing the affine flow vector 119908aff (119901119905) =(119906(119901119905) V(119901119905)) from the original optical flow vector as follows

119908 (119901119905) = 119908 (119901119905) minus 119908aff (119901119905) (1)

Figure 1 shows the dense trajectories extracted by the iDT[12] method and the 119908-flow dense trajectories

The shape of a trajectory encodes local motion patternsThe shape of a trajectory is described by concatenating a setof displacement vectors Δ119875119905 = (119901119905+1 minus 119901119905) = (119909119905+1 minus 119909119905 119910119905+1 minus119910119905) Meanwhile to leverage the motion information in densetrajectories we compute descriptors within a spatiotemporalvolume around the trajectoryThe size of a volume is 32 times 32And the volume is divided into a 2 times 2 times 3 spatiotemporalgrid The Histograms of Optical Flow (HOF and 119908-HOF)[1] descriptor captures the local motion information whichis computed using the orientations and magnitudes of theflow field Motion boundary histogram (MBH) [1] descriptorencodes the relative motion between pixels which is alongboth 119909 and 119910 image axis and describes the discriminatoryfeatures for the action recognition in background clutteringThe trajectory-based 119908-HOF features is computed on thecompensated flow For each trajectory the descriptors com-binemotion informationHOF119908-HOF andMBHThe singletrajectory feature is in the form of

119865 = (119908-HOFHOFMBH119909MBH119910 119878) (2)

The trajectory shape 119878 = (Δ119875119905 Δ119875119905+119871minus1)sum119905+119871minus1119895=119905 Δ119875119895 isnormalized by the sumof themagnitudes of the displacementvectors and 119871 is the length of trajectories

22 Low-Level Video Encoding The representation of video isa vital problem in action recognitionWe first encode the low-level 119908-flow trajectory-based descriptors using the FisherVector (FV) encodingmethod which was proposed for image


Trajectories groups


Findcorresponding

1

2

3

N

VLADencoding

Second layer


Feature encoding


First layer

Inputvideo




S

HOFw-HOFMBH

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2



119866119883120579 = 1119873nabla120579 log119901 (119883 120579) 120579 = 1199081 1205831 1205751 119908119896 120583119896 120575119896

(3)





1198792sum119905=1198791

119889vel (119905) 119905 isin [1198791 1198792]





V = ( 1198991sum119894=1

(1198661119894 minus 1198881) 119899119896sum119894=1

(119866119896119894 minus 119888119896)) (5)



4 Experiments




4719522 539

584 573652

612674

0

10

20

30

40

50

60

70

80

Accu

racy

()

Accuracy ()

Tra

HO

F

w-H

OF

MBH

HO

F + w

-HO

F

HO

F +

MBH

MBH

+ w

-HO

F

MBH

+ w

-HO

F +

HO

F








500

550

600

650

700

750

800

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(a) J-HMDB dataset

68

73

78

83

88

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(b) YouTube dataset




Two-layermodel

JHMDBHOF 522 561

119908-HOF 539 573MBH 584 629

HOF + 119908-HOF + MBH 629 674

YouTubeHOF 776 779

119908-HOF 752 770MBH 842 850

HOF + 119908-HOF + MBH 853 876




5 Conclusions








References
































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in




Trajectories groups


Findcorresponding

1

2

3

N

VLADencoding

Second layer


Feature encoding


First layer

Inputvideo




S

HOFw-HOFMBH

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2

minus4

z

xy

10

8

6

4

44

2

0

002

minus4 minus2



119866119883120579 = 1119873nabla120579 log119901 (119883 120579) 120579 = 1199081 1205831 1205751 119908119896 120583119896 120575119896

(3)





1198792sum119905=1198791

119889vel (119905) 119905 isin [1198791 1198792]





V = ( 1198991sum119894=1

(1198661119894 minus 1198881) 119899119896sum119894=1

(119866119896119894 minus 119888119896)) (5)



4 Experiments




4719522 539

584 573652

612674

0

10

20

30

40

50

60

70

80

Accu

racy

()

Accuracy ()

Tra

HO

F

w-H

OF

MBH

HO

F + w

-HO

F

HO

F +

MBH

MBH

+ w

-HO

F

MBH

+ w

-HO

F +

HO

F








500

550

600

650

700

750

800

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(a) J-HMDB dataset

68

73

78

83

88

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(b) YouTube dataset




Two-layermodel

JHMDBHOF 522 561

119908-HOF 539 573MBH 584 629

HOF + 119908-HOF + MBH 629 674

YouTubeHOF 776 779

119908-HOF 752 770MBH 842 850

HOF + 119908-HOF + MBH 853 876




5 Conclusions








References
































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in





V = ( 1198991sum119894=1

(1198661119894 minus 1198881) 119899119896sum119894=1

(119866119896119894 minus 119888119896)) (5)



4 Experiments




4719522 539

584 573652

612674

0

10

20

30

40

50

60

70

80

Accu

racy

()

Accuracy ()

Tra

HO

F

w-H

OF

MBH

HO

F + w

-HO

F

HO

F +

MBH

MBH

+ w

-HO

F

MBH

+ w

-HO

F +

HO

F








500

550

600

650

700

750

800

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(a) J-HMDB dataset

68

73

78

83

88

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(b) YouTube dataset




Two-layermodel

JHMDBHOF 522 561

119908-HOF 539 573MBH 584 629

HOF + 119908-HOF + MBH 629 674

YouTubeHOF 776 779

119908-HOF 752 770MBH 842 850

HOF + 119908-HOF + MBH 853 876




5 Conclusions








References
































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in




500

550

600

650

700

750

800

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(a) J-HMDB dataset

68

73

78

83

88

HO

F

w-H

OF

MBH

HO

F + w

-HO

F +

MBH


(b) YouTube dataset




Two-layermodel

JHMDBHOF 522 561

119908-HOF 539 573MBH 584 629

HOF + 119908-HOF + MBH 629 674

YouTubeHOF 776 779

119908-HOF 752 770MBH 842 850

HOF + 119908-HOF + MBH 853 876




5 Conclusions








References
































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in









References
































Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in













Advances in

FuzzySystems


Volume 2014












Journal of



Advances in

Multimedia


Biomedical Imaging


Advances in


RoboticsJournal of










Advances in



Representation for Action Recognition Using Trajectory-Based...

Documents

Transcript of Representation for Action Recognition Using Trajectory-Based...