A discriminative model of motion and cross ratio for view invariant action recognition.bak

http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 2187

A Discriminative Model of Motion and CrossRatio for View-Invariant Action Recognition

Kaiqi Huang, Senior Member, IEEE, Yeying Zhang, and Tieniu Tan, Fellow, IEEE

Abstract—Action recognition is very important for many appli-cations such as video surveillance, human–computer interaction,and so on; view-invariant action recognition is hot and difficultas well in this field. In this paper, a new discriminative model isproposed for video-based view-invariant action recognition. In thediscriminative model, motion pattern and view invariants are per-fectly fused together to make a better combination of invarianceand distinctiveness.We address a series of issues, including interestpoint detection in image sequence, motion feature extraction anddescription, and view-invariant calculation. First, motion detectionis used to extract motion information from videos, which is muchmore efficient than traditional backgroundmodeling and tracking-based methods. Second, as for feature representation, we exact va-riety of statistical information frommotion and view-invariant fea-ture based on cross ratio. Last, in the action modeling, we applya discriminative probabilistic model-hidden conditional randomfield to model motion patterns and view invariants, by which wecould fuse the statistics of motion and projective invariability ofcross ratio in one framework. Experimental results demonstratethat our method can improve the ability to distinguish differentcategories of actions with high robustness to view change in realcircumstances.

Index Terms—Action recognition, cross ratios, motion detection,view invariance.

I. INTRODUCTION

H UMAN action recognition in video sequences is one ofthe important and challenging problems in computer vi-

sion, which aims to build the mapping between dynamic imageinformation and semantic understanding. While the analysis ofhuman action tries to discover the underlying patterns of humanaction in image data, it is also much useful in many real applica-tions such as intelligent video surveillance, content-based imageretrieval, event detection, and so on. Human action recognitioninvolves a series of problems such as image data acquisition,robust feature extraction and representation, training classifierwith high discriminative capability, and other application prob-lems that may come out in practical system running. In the short

Manuscript received March 10, 2011; revised July 18, 2011 and October 14,2011; accepted November 07, 2011. Date of publication February 03, 2012;date of current version March 21, 2012. This work was supported by the Na-tional Natural Science Foundation of China under Grant 61135002 and Grant61175007. The associate editor coordinating the review of this manuscript andapproving it for publication was Dr. Nikolaos V. Boulgouris.The authors are with the National Laboratory of Pattern Recognition, Institute

of Automation, Chinese Academy of Sciences, Beijing 100190, China.Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIP.2011.2176346

history of research on computer vision, many researchers haveworked out many elaborate solutions for these problems.In feature extraction, appearance-based methods play an im-

portant role in action recognition under constrained environ-ment. By tracking and then classification, contours of humanbody are extracted from real-time videos and then representedby volumes or energy templates, like in [1]. Wang et al. repre-sent the binary image of human silhouette using R transform,which is efficient for its low computational cost and geometricinvariance [2]. They could handle unstable contour features withthe existence of frame loss in video and disjoint silhouettes withholes. Souvenir and Parrigan extended the R-transform methodto temporal domain and used manifold learning to model the se-quential variations of contours [3]. To achieve higher accuracy,skeleton modeling was introduced to describe interactions be-tween joints of human body in [4] and [5].Motion features such as optical flow and trajectory provide

much useful information in the analysis of human behavior.Action could be recognized by estimating movement from po-sition and speed information. These features have been inten-sively used in this area, such as in [6]–[14]. Since motion fea-tures have much better performance in unconstrained scenes inmotion analysis, in our method, we use the optical flow featureto learn motion patterns for view-invariant action recognition.Beyond traditional template matching methods, parametric

models such as hiddenMarkov models (HMMs) are well knownin action recognition. Yamato et al. used HMM to model con-tour feature extracted from video. Their method estimates thetransitions between different hidden states by expectation max-imization (EM) and then computes the likelihood of the obser-vation [15]. For more complex activities such as interactionsbetween two people, coupled HMMs were introduced to modeldynamic interaction process of motion information in videos,which could provide superior training speeds, much more re-liable model likelihood, and high robustness to more complexconditions [16].HMM is a simple and efficient generative model, but lim-

itations are conditional independence assumption betweenobservations, strong prior knowledge of the data distribution,and local optimization. To increase the discrimination powerof modeling, discriminative models such as support vectormachines (SVMs), bag of words, and conditional random fields(CRFs) are used to evaluate the conditional probability of theobservation given the learned models [17]–[19]. This kindof models may achieve higher recognition rate when a largetraining data set is available.New algorithms and systems are in bloom as intelligent

video surveillance draws much attention to the defense for

1057-7149/$26.00 © 2011 IEEE

http://ieeexploreprojects.blogspot.com

http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com

2188 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012

public security. However, action recognition becomes quitedifficult under unconstrained conditions such as view change,cluttered background, and occlusions. In this paper, we focuson view-invariant action recognition to elaborate a new methodmore capable of dealing with viewpoint variations in realapplications.The kinds of methods concerning view invariance in action

recognition have been proposed. Generally, they could be cat-egorized into three categories. First, 3-D reconstruction tech-niques could provide the most reliable view-independent rep-resentation of actions. In [20] and [21], the images recordedby multiple calibrated cameras are projected back to 3-D vi-sual hulls of human body in different poses. However, the highcomputational cost of this method limits its practical applica-tions since it has to calculate calibration configurations of cam-eras and correspondences between different images from dif-ferent views. Second, the epipolar geometric relations in mul-tiple-view geometry lead to some constraints between imagepoints in different views. For example, the fundamental ratiosproposed in [22], which are the ratios of fundamental matrixand proved to be invariant to viewpoint, could be used to repre-sent pose transitions in a model-based method. The problem isthat manually labeling of joints of the human body is requiredto find those triplets of points necessary for homography calcu-lation. Third, the mapping between motion representation andview-invariant patterns could be automatically learned by ma-chine learning techniques such as those in [23] and [24]. Thesemethods do not need to extract view-invariant features from im-ages. Nevertheless, they implicitly assume that the mapping sat-isfies the underlying predefined models with empirical priorsand are not clear on which aspect of their representation of ac-tions accounts for the observations. Last, people also try to con-struct invariants from images. For example, [25] applies a spa-tiotemporal curvature of 2-D trajectory of hands to capture dra-matic changes in motion. The curvature is a view-invariant fea-ture, but a high-order derivative degrades signal-to-noise ratiosince the curve is not smooth enough. The method in [26] as-sumes that there is a moment in an action when some of thejoints of the body are coplanar, i.e., canonical pose, in order tofind invariant patterns in actions. However, the application isquite limited since it is hard to detect such a canonical pose invideos automatically and correctly.A compact framework is proposed for view-invariant action

recognition from the perspective of motion information inimage sequences, in which we could properly encapsulatemotion pattern and view invariants in such a model that resultsin a complementary fusion of two aspects of characteristicsof human actions. In the following sections, we will discussa series of issues relating to interest point detection in imagesequence, motion feature extraction, and representation andmodel selection. Above all, the main contribution of this paperlies in the following three aspects.• New features are extracted from video motion information,including view-invariant feature–cross ratio and optical-flow-based features.

• A new feature representation is described based on opticalflow in the statistical way of oriented histogram projectionto keep the feature representation away from much noise.

Fig. 1. Flowchart of our method.

• A discriminative probabilistic model, i.e., hidden CRF(hCRF), is used to fuse the proposed statistics of motionand projective invariability of cross ratio in one effectiveand compact framework.

Practically, the proposed method has shown excellent discrimi-nation ability in recognizing different actions while preservinghigh robustness to view changes in real circumstances.

II. FRAMEWORK OF OUR METHOD

In this paper, we will describe our method in three phases. Inthe first stage, we will introduce a motion detection method todetect interest points in space and time domain, which could fa-cilitate optical flow extraction not only in an expected local areabut also for view invariants–cross ratio from those detected in-terest points. In the second stage, the oriented optical flow fea-ture is described to be represented by oriented histogram pro-jection to capture statistical motion information, and in the thirdstage, the optical flow and view-invariant features are fused to-gether in a discriminative model.The proposed framework of action recognition has shown

good integration and achieved expected performance on somechallenging databases. The flowchart of the proposed method isshown in Fig. 1.

III. MOTION DETECTION

In previous work of view-invariant action recognition, it isdifficult to obtain very accurate trajectories of moving objectsbecause of noise caused by self-occlusions [27]. Appearance-based methods such as scale-invariant feature transform (SIFT)[28] are not quite suitable for motion analysis since those ap-pearance-based methods, such as color, gray level, and texture,are not stable among neighboring frames of dynamic scenes.Compounded by nonrigidity of the human body, extraction ofstable continuous interest points is far from easy.For object detection, backgroundmodeling and tracking tech-

niques may be the first choice. However, existing methods ofbackground modeling, such as Gaussian mixture model, sufferfrom low effectiveness required for accurate human behavioranalysis. For example, some traditional methods do not workwell with the existence of shadows, light changes, and, particu-larly, view changes in real scenes. Consequently, we are inclined


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com

HUANG et al.: MODEL OF MOTION AND CROSS RATIO FOR VIEW-INVARIANT ACTION RECOGNITION 2189

to methods of directly extracting informative features from im-ages without background modeling and tracking, which couldbe achieved by taking advantage of image saliency measure-ment to detect key points from images.Usually, object detection starts from corner detection by

making use of gray gradient in the 2-D image plane. In orderto detect such a region of interest in images, the measurementof cornerness has to be defined. The most common method isto take the Laplacian of Gaussian (LOG) method as a responseof image gray gradients; for example, the response function ofHarris corner detection is defined as

trace

(1)

where and are the eigenvalues of the derivative ofGaussian of the local area in image , and is anunknown parameter.Lowe proposed the SIFT feature detection method whose re-

sponse function is the deviation of LOG on different scales andachieved invariance in scale space [28].The idea of 2-D detection of interest point is extended to

the image sequences in 3-D space and time domain. In addi-tion to gray scale, a measurement of gray gradient change alongtime could be defined as another portion of energy response intimescale. Dollar et al. proposed Gabor filter-based algorithmof space–time interest point (STIP) detection [29]. Unlike theoriginal methods in [30], Dollar’s method takes into account thegradient of an image spatially and temporally by using a Laplaceoperator to detect the unstability of image intensity. They applya Gabor filter to measure the saliency of a specific region in theimage sequences in space and time domain. Since the Gaborfilter fairly captures texture information of images, it is widelyused for iris recognition, fingerprint recognition, or any otherapplications of texture analysis.In their method, the response function is defined as follows:

(2)

where the image sequences are denoted as . In addition,represents 2-D Gaussian kernel ; and area pair of temporal Gabor filters, corresponding to andchannels

(3)

Each component of the response function reflects the degreeof deviation of image intensity in space and time domain. Ex-perimentally, they tune the values of and to customize thewindow size of the convolution operation spatially and tempo-rally.By convolution operation within a time window, the local

gradient information turns out to be the form of energy. Afterthresholding, the points of interest could be located.Another method considering the spatial–temporal volume

achieves better results in feature extraction [43], which models

Fig. 2. STIP detection.

the motion variations in the local neighborhood. In our method,the cross ratio is important for view invariance, which needsthe action trajectories with the stable key points. Intrinsically,we also make use of the spatial–temporal property. Here, wetake the advantage of STIP detection [29] to extract motion in-formation in video sequences and meanwhile to ease extractionof view invariants directly from images. As shown in Fig. 2,the red points illustrate the detected STIPs when a person isjumping.In this way, we have extracted many key points informative

for recognition in spite of the occurrences of noise. Since manypoints are detected, a nonmaxima suppression method will beused to select the relatively stable points of interest as a rep-resentation of the current frame, which gives much better per-formance particularly for periodic motion patterns. The algo-rithm could run more than 25 frames/s on a PC with Intel Quad2 2.8-GHz CPU.

A. Nonmaxima Suppression

Since multiple key points could be found in each frame, weuse a nonmaxima suppression method [46] to select a relativelyreliable one as the point of interest of this frame.As shown in Fig. 3, for each detected key point, we define a

radius to it and then compare its energy response to those inthe neighboring area of circle with radius . If there is no otherpoint whose energy response presents a higher value than thatof the central point, the center is considered to be a candidatepoint of interest in this image. After finishing traversing all thedetected points, we select the points with the highest value ofenergy response from all the candidate points as the final pointsof interest in this image. In addition, for each sequence of an ac-tion, we define a fixed size of sliding window as a basic unit ofdata sample. Correspondingly, we pick up a fixed number of in-terest points within the window to describe an action sequence.Although a single point per frame is selected, the overall sta-

bility of those points from the image sequences ensures highrobustness for the view-invariant feature extracted from con-secutive interest points in the neighborhood, as we will see inour tests in later sections.

IV. FEATURE EXTRACTION AND REPRESENTATION

In view-invariant action recognition, it is challenging tofind an appropriate feature that is robust to view change while


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


Fig. 3. Nonmaxima suppression.

preserving discrimination in recognition. The most importantcharacteristic of view invariance is the stability under differentviews but distinctive from different classes. As known in in-formation theory, the principle of entropy tells that the greaterentropy of an observation, the less amount of information it car-ries. In addition, empirically, the more random things happen,the more uniform the distribution of the data is. Therefore,trying to extract a feature that is absolute invariant to viewchange will necessarily lead to decline of its discriminationpower. Here, we would consider the tradeoff of the invarianceand the discrimination by combining motion features and viewinvariants. Therefore, we will put forward a method of combi-nation of motion features and view invariants and use a learningmethod to obtain a tradeoff between the aforementioned twoaspects.

A. Motion Feature

Since appearance-based features such as Harris, histogram oforiented gradient (HOG), SIFT, Gabor, and shape highly dependon the stability of image processing, they fail to accurately rec-ognize different kinds of actions because of the nonrigidity na-ture of the human body or some other impacts in real applica-tions. Therefore, in our method, after detection of interest pointsin videos, we extract motion features from the neighboring areaaround the interest points and build the representation of the sta-tistical properties of the local area of the image.a) Optical Flow Extraction: Optical flow takes the form of

2-D vector representing image pixel velocity in the - and-directions. The beginning and ending points of the optical

flow vector correspond to displacement of image pixels.There are mainly two kinds of methods to extract optical flow

from images. The first one is a feature-based method, which cal-culates the matching score of the feature points between neigh-boring frames and takes the displacements of thematched pointsas the start and endpoints of the optical flow vector. However,due to the instability of the image edges and large displacementof moving human body, the calculated optical flow could hardlyexhibit the real movement of human body. The second one isgradient-based methods, which are widely used in computer vi-sion tasks. Gradient-based methods assume that the gray levelin a local area of the images is relatively stable between adjacentframes.More importantly, by calculating the image gradient andoptimizing the cost function in an iterative way, we can give adense field of optical flow.In this paper, we use the pyramidal Lucas–Kanade (PLK) al-

gorithm [31] to calculate optical flow from the image sequence.

Fig. 4. Histogram projection.

Unlike the Lucas–Kanade method [32], PLK samples down im-ages into different scales and then minimizes the gradient varia-tions of gray level in the local area between neighboring framesiteratively.Since PLK has to compute the derivation of the image inten-

sity on different scales iteratively, we only extract optical flowfeatures of a local area around the STIPs instead of the wholeimage, by which we could avoid high computational cost andmeanwhile reduce redundancy in feature extraction.b) Feature Description of Local Motion: The drawback of

optical flow is that it brings much noise in the experiment, par-ticularly when the object is moving fast with large displace-ment. Therefore, we would rather take the advantage of statis-tical characteristics of optical flow, for example, the main di-rection or histogram of optical flow of image patch, than thinkof the optical flow vector as the pixel displacement. In pre-vious work [33], HOG achieved good results in pedestrian de-tection mainly because of its effective statistical feature expres-sion strategy. Similarly, we project the magnitude of opticalflows into directional histogram bins to describe the local mo-tion information.After a key point is detected, we apply the PLK algorithm

to calculate optical flow in the local area around the key point.Each optical flow within an image cell carries a weight in pro-portion to its magnitude in the projection into the histogrambins. Histograms from different image cells in a block are con-catenated together to form a feature vector.As shown in Fig. 4, we divide a circumference

into eight equivalent bins in our experiment. Each bin collectsvoted weights of the magnitudes of the optical flows in the cur-rent region of interest. For each optical flow vector ,the weights are set to and ,where and are the orientation gaps between the optical flowand boundary of the bins. The image block is set to cells,and each cell is divided into eight equivalent bins, which arevoted into eight histogram bins, as shown in Fig. 4; thus, theprojected histogram forms a (128) dimensional fea-ture vector.It is worth noting that the spatiotemporal volume achieved ex-

cellent results in representing motion information in [43] for itsproperty on modeling motion information in the local neighbor-hood. Intrinsically, we also take this advantage. The differenceis that the spatiotemporal volumes are cuboids in videos in theirmethod, but they are blocks along consecutive interest points inour method.


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


Fig. 5. Sets of four points with identical cross ratios under projective transfor-mations.

B. View Invariants

Geometric invariants capture invariant information of a geo-metric configuration under a certain class of transformations.Group theory gives us theoretical foundation for constructinginvariants [34]. Since they could be measured directly fromimages without knowing the orientation and position of thecamera, they have been widely used for object recognition totackle the problem of projective distortion caused by viewpointvariations.In view-invariant action recognition, traditional-model-based

methods evaluate the fitness between image points and the pre-defined 3-D models. However, it is difficult to detect qualifiedimage points that satisfy the specific geometric configuration re-quired to get the desired invariants.Cross ratio is the most common invariant. As shown in Fig. 5,

the sets of four collinear points with the same permutation lyingon different planes form cross ratios with the same value. To geta cross ratio as an invariant, the image points must be collinearin the original 3-D space before projection.To avoid the constraints of collinearity of image points when

constructing invariants from an image, we calculate invariantsacross neighboring frames rather than from a single image.We generalize the cross ratio of four collinear points to cross

ratios across frames (CRAFs) using five neighboring coplanarpoints sampled from trajectories of actions. The only assump-tion we should make is that the five points detected from neigh-boring frames are approximately coplanar. Under this assump-tion, the proposed method does not need a model human bodyand manual labeling of image points.a) Invariants Across Frames: In previous work [27], we

have assumed that trajectories of several key joints on thebody, such as hand, foot, or head, could be obtained by fea-ture-tracking techniques. Once we get trajectories from theimage sequences, we could construct a pair of cross ratios forevery five points sampled from the trajectories. Thus, pairs ofcross ratios are transformed to histograms as the feature vectorsin SVM classification. In this paper, we take the sequentialSTIPs as the motion trajectories of action and extract view-in-variant feature–cross ratio from the key points detected frommultiframes.Cross ratio is invariant to projective transformations. It is de-

fined as

(4)

Fig. 6. Sets of four points with identical cross ratios under projective transfor-mations.

Here, , , , and represent a set of four collinearpoints, and the value of is preserved by pro-jective transformations.This precondition of collinearity makes the application of

cross ratio of four collinear points limited. Therefore, wemake ageneralization by constructing a pair of cross ratios in the sameway as in [34]. As illustrated in Fig. 6, if we have obtaineda trajectory and there are five points ,which are approximately coplanar, we use these five points onthe trajectory to generate two groups of four collinear points,i.e., and . With the two groups ofcollinear points, we compute their cross ratios and denote themas CR and CR , respectively. Thus, we get the view-invariantrepresentation of these five points as follows:

CR CR (5)

where CR CR denotes our view-invariant representationof five trajectory points. The only precondition of this general-ization is the coplanarity of the five points. Empirical tests inSection VI show that the precondition is satisfied in most realcases.Computing CR and CR is straightforward as long as the

coordinates of the five points on the image plane are known.Here, we use formulas that have been proven in higher geometryas follows:

CR

(6)

CR

(7)

where is the determinant of the 2 2 matrix .Degenerated groups of points might appear while computing

CR . For example, the line defined by and is parallel tothe line defined by and , or , , and are collinear.In these cases, we either assign a fixed number to CR relativelylarge or just ignore them. Since most of the sampled points arein general position, the degenerated groups do not affect theoutputs of the algorithm.b) CRAF Histograms: For each five points, we get a se-

quence of pairs of cross ratios. These CR are voted into bins


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


to form a histogram as the representation of the feature vectorfor classification. In detail, the value of each histogram bin isdefined as

where if CRelse

(8)

where is the count of CR , and and correspondto the lower and upper bounds of the th bin of the histogram.Practically, the values of cross ratios vary from 0 to 1, and here,we discretize them into eight bins.

V. ACTION MODELING

Up to now, we have obtained motion feature description andview invariants of interest points. The remaining problem is howto model the temporal information from sequential data. Unlikeobject classification in static image, action recognition shouldtake into account the temporal dependence and paces of an ac-tion. In the literature of time series analysis, the most commonlyused model is the HMM, particularly in speech recognition andhuman action recognition, which models the joint probabilityof observation and state sequences given the model parameters.Once emission probability is set, state transition matrix A andobservation matrix B could be learned by an EM algorithm. Inthe recognition process, HMM predicts the category of a se-quence by calculating the likelihood of the observation giventhe model.Although HMM has a very simple and effective structure,

it suffers from several limitations such as conditional indepen-dence assumptions between observations, strong prior knowl-edge of the data, and local optimization. CRF [18] is widelyused in word segmentation, named-entity recognition, text anal-ysis, and so on. In [35], hidden states are introduced to modelthe different part in object recognition. Because of its excellentperformance, it has been also applied to human action recogni-tion [8], [17], [36]–[38].To effectively model more complex human actions, a dis-

criminative model based on maximum entropy–hCRF is usedfor motion features of an action sequence [45]. Another modelnamed latent-dynamic CRF, which can model dynamics be-tween actions and allow automatic action segmentation [44],needs much more computation cost. In this paper, view-in-variant feature could be integrated into the hCRF model tomaintain an overall high recognition rate, which results in goodrobustness of the model to the change in view angles.With the hCRF model, we could make full use of informa-

tion extracted from neighboring frames rather than a singlestatic image and meanwhile bring the view invariants togetherin dealing with view changes in action recognition applica-tions. In [8], hCRF is used to model the dependence betweendifferent image patches. Intuitively, it could be also used tomodel different phases of human actions in image sequences.The graphic structure of hCRF is shown in Fig. 7, in whichdenotes the observations, representing optical flow feature.is the hidden variable in the middle layer of the graph;

Fig. 7. Structure of hidden CRF.

indicates the class label of actions. Overall, the graph outputs aconditional probability, which takes the form of

(9)where is the unknown parameters of the model; graph vertices,denoted by , represent different variables. The edges betweenthe different nodes of the graph, denoted by , describe interac-tions between different variables. The sets of nodes and edgesconstitute graph .According to Fig. 7, we can formulate the potential function

in (9) as follows:

(10)

The four components on the right-hand side of (10) corre-spond to four different connections in the graphic structureshown in Fig. 7. The specific definition of each part is cus-tomized in our experiment as follows.1) Observations of motion and hidden states .This component describes the relation between observa-tions and hidden variables, which measures the matchingscore between motion features extracted from images andhidden states. In our experiment, observations corre-spond to the histograms obtained by oriented optical flowprojection

(11)

where is the parameter that measures the compatibilitybetween observation and hidden state.

2) Hidden states and labels

(12)

where measures the compatibility between hiddenstates and labels.

3) Different hidden states and labels

(13)


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


where measures the compatibility among differenthidden states and labels linked by edges in the graph shownin Fig. 7.

4) View invariants and labels The part of po-tential function takes into account the view-invariant fea-ture–cross ratio, represented by , which could be com-puted once the location of the points of interest is known

(14)

where measures the compatibility between view invari-ants and labels.

By normalization, the potentials summed up are convertedinto conditional probability that we have expected.In the training process, a gradient descentmethod is applied to

get model parameters iteratively. The logarithmic derivativesof the conditional probability with respect to different parame-ters are

(15)

The learning process starts with a given initial value of , andwe could obtain a local optimal solution after a number of itera-tions by minimizing the logarithmic derivatives. Specifically, inour method, we have assumed that the graph obeys a tree struc-ture, which could be approximated by using a minimum span-ning tree algorithm. In addition, all the expectations in (15) arecomputed by belief propagation.

VI. EXPERIMENTAL RESULTS AND ANALYSIS

Here, we give a thorough illustration of our experimental re-sults after intensively testing on both view-invariant featuresseparately and fusion with motion features.

A. View Invariance

To testify the effectiveness of cross ratio as a view-invariantfeature, we first use Carnegie Mellon University Motion Cap-ture (MoCap) Database to evaluate its robustness to viewchange. MoCap database records 3-D position information cap-tured from sensors on the body of the subject. After projectiononto image planes, we could get 2-D trajectories in differentviews.Tomake a comparison with the state of the art, our experiment

is under the same condition with [22].In the projection process, there are 17 synthesized cameras

uniformly distributed around a hemisphere. The distribution of

Fig. 8. Camera position distribution [22].

Fig. 9. Projected trajectories of hand on each viewpoint.

the cameras is depicted in Fig. 8. All the actions are performedaround the center within the hemisphere. We project 3-D dataonto images of each viewpoint with the focal length randomlychosen in a range of 1000 300 mm. Fig. 9 gives an exampleof projected trajectories of hand. It illustrates the action of jumpin each viewpoint with varying appearance caused by projectivedistortions.We select five classes of actions, i.e., climb, jump, run, swing,

and walk, for the database to test. For each action, we get thetrajectories of head, left hand, and left foot of the subject. Forevery neighboring five points on the trajectory, we compute apair of CR by (6) and (7). We transform CR of each action tohistograms as the view-invariant features of the action.After projection, we get 200 trajectories of each viewpoint,

specifically 12 sequences for climb, 57 sequences for jump, 41sequences for run, 10 sequences for swing, and 80 sequencesfor walk. The data provided are unbalanced; hence, a weightedtraining strategy is applied in the training process. We use anSVM as the classifier. In SVM training, the radial basic functionkernel parameters are chosen by way of grid search. We trainonemodel for each viewpoint and test it on the other viewpoints.The output of each viewpoint is the one with the highest score.The performance is shown in Fig. 10. Although the recogni-

tion rate of some views is a little lower than that in [22], theaverage accuracy is about 92.38%, which is much higher com-pared to 81.60% in [22], demonstrating high stability over the17 viewpoints.Theoretically, the cross ratios of five coplanar points in gen-

eral position remain the same under projective transformations.Since we have assumed that the five points used to compute CRare approximately coplanar, we evaluate the variance of CR ofa group of neighboring five points at different sampling rates.


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


Fig. 10. Recognition rate in different views.

Fig. 11. Mean and variance of CRs in different viewpoint frames.

The variance and mean curves with respect to different sam-pling rates are shown in Fig. 11.The mean value of CR is around 0.6. As shown in Fig. 11,

the variance is negligible compared to the mean value when thesampling rate is above 25 Hz, which is to say that the calculatedCR is stable as long as the frame rate is above 25 Hz, indicatingthat our approximate coplanar assumption is acceptable underreal circumstances.

B. Performance on Public Action Data Sets

Cross ratio has shown high stability and robustness to viewchange in motion capture data set. In order to evaluate its ef-fectiveness in real data, cross ratio is incorporated with motionfeature by a discriminative model on public data sets to see theoverall performance.For single view, we test our method on two public action data

sets, namely, the Weizmann action data set [39] and the KTHaction data set [40].The Weizmann action data set contains 90 video sequences

of 10 natural actions performed by nine different people.For comparison with the state of the art, we selected ninekinds of actions, including running, walking, jumping-jack,jumping-forward-on-two-legs, jumping-in-place-on-two-legs,galloping-sideways, waving-two-hands, waving-one-hand, andbending, similar to what was done in [8]. The KTH action dataset contains six kinds of actions, including walking, jogging,running, boxing, hand waving, and hand clapping, with largevariations both in appearance and scenes.Respectively, the two data sets are partitioned by subjects into

two equal subsets. One subset contains videos of half of the sub-jects for training and the other one of the remaining subjects fortesting. We then cut all the video sequences in the two data setsinto subsequences with 30 frames for each sequence. Half of thesubsequences are chosen as the training set. Unlike by tracking

Fig. 12. Comparison results on Weizmann and KTH action data sets.

Fig. 13. Results on different scenarios on KTH data set.

and background subtraction in [8], we use the STIP method andthen compute oriented optical flow histograms in the detectedlocal region, as depicted in Sections III and IV. In the learningprocess, the cost between different hidden nodes of hCRF ismeasured by Euclidean distance in spatiotemporal domain. Asillustrated in Fig. 12, the performance is comparable to state ofart [8], [41]–[43]. The average recognition rate (89.7%) is betterthan that in [8] (87.6%) because of the effective feature expres-sion by oriented histogram projection. Although the accuracyof our method is a little lower than that in [42] (91.7%) and in[43] (91.8%), further evaluation demonstrates that our methodbears high flexibility when actions become complex in a morechallenging data set.To evaluate the accuracy of different scenarios on the KTH

data set, we test our method using four different combinationsof training data as [40]. The results are given in Fig. 13. s1,s2, s3, and s4 are four different scenarios, i.e., outdoors, out-doors with scale variation, outdoors with different clothes, andindoors, respectively. In Fig. 13, we can see that our methodachieves nearly the same results (s1 is 82.1%; s1 s4 is 83.2%;s1 s3 s4 is 85.6%; and s1 s2 s3 s4 is 89.7%, all over80%) in the different scenarios, which shows the robustness ofour method.We also test our method on a large multiview action data

set, i.e., the Institute of Automation, Chinese Academy of Sci-ences (CASIA) action data set [12]. The CASIA action data setcontains sequences of human activities captured by video cam-eras outdoors from different angles of view. There are 1446 se-quences in all containing eight types of single-person actions(walk, run, bend, jump, crouch, faint, wander, and punching acar) performed each by 24 subjects and seven types of two-person interactions (rob, fight, follow, follow–gather, meet–part,meet–gather, and overtake) performed by every two subjects.All video sequences were simultaneously taken with three non-calibrated stationary cameras from different angles of view (hor-izontal, angle, and top-down views).We selected six kinds of actions in our experiment, including

bend, crouch, fall, jump, run, and walk, because these actionscan be described and compared better with other methods basedon feature descriptors. Similarly, we use videos of 12 subjectsfor training, and the other videos of the remaining 12 subjects for


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


TABLE IRECOGNITION RESULTS USING SVM FRAME BY FRAME (OVERALL, 74.5%)

TABLE IIRECOGNITION RESULTS USING HMM BASEDON CONTOUR FEATURE (OVERALL, 78.7%)

TABLE IIIRECOGNITION RESULTS USING HCRF (OVERALL, 84.2%)

testing. We then cut all the video sequences into subsequenceswith 30 frames for each sequence.We compared the effectiveness of hCRF with those of SVM

and HMM on horizontal view. In hCRF modeling, we assumea tree structure of the graph and use Euclidean distance to mea-sure the cost between different nodes. Every 30 frames from thevideo is taken as an action sequence, and for each sequence,15–20 points of interest are selected. The results are shown inTables I–IIIhCRF achieves the best results compared with SVM and

HMM. In the learning process, sequential modeling methods,i.e., hCRF and HMM, obtain much better accuracy than SVMat the expense of twice as much training time as SVM costs;however, their internal structure facilitates more expressivemodels for complex actions. For example, in recognizing “fall”and “crouch,” hCRF and HMM achieve much better results.On the other hand, they are more capable of modeling actionswith obscure discriminative boundaries; for example, “run”and “walk” are easily confused without knowing the pace oftheir execution.To verify the effectiveness of the fusion of motion feature

and view invariants, we tested our method and state-of-the-artmethods [2], [8], [42] on different views in CASIA multiviewdatabase. For each view, we train a model and then test it againstthe other two views. The results are shown in Table IV.In Table IV, the commonly used appearance-based methods

[2], [42] give better results under horizontal view. As it is

TABLE IVRECOGNITION RESULTS OF DIFFERENT VIEWS

shown in Table IV, the accuracy values of 78.8% and 70.5%are obtained; however, it seems that they are more vulnerableto view change than motion features [8] and achieves higheraverage precision in side view (54.2%) and top-down view(47.8%) because of motion feature. Simple cross ratio (STIPCR hCRF) outputs lower accuracy than motion (STIP

OF hCRF) because cross ratio cannot be stable enough inreal scenes. Our method, considering the optical flow and crossratio (STIP OF CR hCRF), obtains the best results inthree views; as compared with [8], our method can achieveover nearly 10% accuracy improvement in three views.Although using view invariants alone for action recognition

produces low recognition rate, it does help to maintain robust-ness to view change in fusion (STIP OF CR hCRF)for its inherent invariability among different views, even if theview angle extremely changes. As shown in Table IV, cross ratiohelps to improve 6%–10% precision after fusion with opticalflow in the hCRF framework, even when the view angle be-comes top down.

VII. CONCLUSION AND DISCUSSION

In this paper, we have proposed a method for view-invariantaction recognition, which could naturally encapsulate motionpattern and view invariants. A feature detection method is usedto extract motion information from image sequences, whichis much more efficient than traditional background modelingmethods; for feature representation, a variety of statistical in-formation is fused to overcome much noise in motion features.When computing invariants across frames, we made general-izations to cross ratio of four collinear points so that it couldbe applied to view-invariant representation of actions; as forthe time series modeling, a discriminative probabilistic model,i.e., hCRF, is applied to model temporal motion patterns andview invariants, by which we could consider motion patternsand view invariants in one framework. Experimental resultsdemonstrate that the proposed method presents excellent dis-crimination ability in recognizing different actions with highrobustness to view changes in real circumstances.


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


However, since hidden states are introduced into the expres-sion of conditional probability, the object function fails to pre-serve the convexity. Therefore, we could only obtain a localoptimal solution hCRF. Moreover, we have to pass and collectmessages for each node in the graph in the process of gradientascent optimization, which brings much computational cost formodel training.

REFERENCES

[1] A. F. Bobick and J. W. Davis, “The recognition of human movementusing temporal templates,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 23, no. 3, pp. 257–267, Mar. 2001.

[2] Y. Wang, K. Huang, and T. Tan, “Human activity recognition based onr transform,” in Proc. IEEE CVPR, 2007, pp. 1–8.

[3] R. Souvenir and K. Parrigan, “Viewpoint manifolds for action recog-nition,” J. Image Video Process., vol. 1, pp. 1–13, 2009.

[4] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, “Moving target classificationand tracking from real-time video,” in Proc. 4th IEEEWACV, 1998, pp.8–14.

[5] H. Fujiyoshi and A. J. Lipton, “Real-time human motion analysis byimage skeletonization,” in Proc. 4th IEEE WACV, 1998, pp. 15–21.

[6] F. I. Bashir, A. K. Ashfaq, and S. Dan, “View-invariant motion trajec-tory-based activity classification and recognition,” Multimedia Syst.,vol. 12, no. 1, pp. 45–54, Aug. 2006.

[7] M. Ahmad and S.-W. Lee, “Human action recognition using shape andclg-motion flow frommulti-view image sequences,” Pattern Recognit.,vol. 41, no. 7, pp. 2237–2252, Jul. 2008.

[8] Y. Wang and G. Mori, “Learning a discriminative hidden part modelfor human action recognition,” in Proc. NIPS, 2008, vol. 21, pp.1721–1728.

[9] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action ata distance,” in Proc. ICCV, Nice, France, 2003, pp. 726–733.

[10] N. Johnson and D. Hogg, “Learning the distribution of object trajecto-ries for event recognition,” in Proc. 6th BMVC, 1995, pp. 583–592.

[11] C. Stauffer and W. E. L. Grimson, “Learning patterns of activity usingreal-time tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22,no. 8, pp. 747–757, Aug. 2000.

[12] K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan, “View independenthuman behavior analysis,” IEEE Trans. Syst., Man, Cybern. B, Cy-bern., vol. 39, no. 4, pp. 1028–1035, Aug. 2009.

[13] D. Buzan, S. Sclaroff, and G. Kollios, “Extraction and clustering ofmotion trajectories in video,” in Proc. ICPR, Washington, DC, 2004,pp. 521–524.

[14] W. Hu, D. Xie, and T. Tan, “A hierarchical self-organizing approachfor learning the patterns of motion trajectories,” IEEE Trans. NeuralNetw., vol. 15, no. 1, pp. 135–144, Jan. 2004.

[15] J. Yamato, J. Ohya, and K. Ishii, “Recognizing human action in time-sequential images using hidden Markov model,” in Proc. IEEE CVPR,1992, pp. 379–385.

[16] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markovmodels for complex action recognition,” in Proc. CVPR, 1997, pp.994–999.

[17] C. Sminchisescu, A. Kanaujia, and D. Metaxas, “Conditional modelsfor contextual human motion recognition,” Comput. Vis. Image Under-standing, vol. 104, no. 2/3, pp. 210–220, Nov./Dec. 2006.

[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditionalrandom fields: Probabilistic models for segmenting and labelingsequence data,” in Proc. 18th ICML, 2001, pp. 282–289.

[19] M. A. Mendoza and N. Pérez De La Blanca, “Applying space statemodels in human action recognition: A comparative study,” in Proc.5th Int. Conf. AMDO, 2008, pp. 53–62.

[20] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint actionrecognition using motion history volumes,” Comput. Vis. ImageUnderstanding, vol. 104, no. 2/3, pp. 249–257, Nov./Dec. 2006.

[21] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbi-trary views using 3D exemplars,” in Proc. IEEE ICCV, 2007, pp. 1–7.

[22] Y. Shen and H. Foroosh, “View-invariant action recognition using fun-damental ratios,” in Proc. IEEE CVPR, 2008, pp. 1–6.

[23] P. Natarajan and R. Nevatia, “View and scale invariant action recogni-tion using multiview shape-flow models,” in Proc. IEEE CVPR, 2008,pp. 1–8.

[24] R. Souvenir and J. Babbs, “Learning the viewpoint manifold for actionrecognition,” in Proc. IEEE CVPR, 2008, pp. 1–7.

[25] C. Rao, A. Yilmaz, and M. Shah, “View-invariant representation andrecognition of actions,” Int. J. Comput. Vis., vol. 50, no. 2, pp. 203–226,Nov. 2002.

[26] V. Parameswaran and R. Chellappa, “View invariants for human actionrecognition,” in Proc. IEEE CVPR, 2003, vol. 2, pp. 613–621.

[27] Y. Zhang, K. Huang, Y. Huang, and T. Tan, “View-invariant actionrecognition using cross ratios across frames,” in Proc. ICIP, 2009, pp.3549–3552.

[28] D. G. Lowe, “Object recognition from local scale-invariant features,”in Proc. ICCV, 2001, pp. 1150–1158.

[29] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recogni-tion via sparse spatio-temporal features,” in Proc. 14th ICCCN, 2005,pp. 65–72.

[30] I. Laptev and T. Lindeberg, “Space–time interest points,” in Proc.ICCV, 2003, pp. 432–439.

[31] J. Y. Bouguet, “Pyramidal implementation of the Lucas Kanade featuretracker: Description of the algorithm,” in Proc. KLT ImplementationOpenCV, 2002, pp. 1–9.

[32] B. D. Lucas and T. Kanade, “An iterative image registration techniquewith an application to stereo vision,” in Proc. 7th IJCAI, 1981, pp.674–679.

[33] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE CVPR, 2005, pp. 886–893.

[34] Geometric Invariance in Computer Vision, J. L. Mundy and A. Zis-serman, Eds. Cambridge, MA: MIT Press, 1992.

[35] A. Quattoni, M. Collins, and T. Darrell, “Conditional random fields forobject recognition,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp.1097–1104.

[36] S. B. Wang, A. Quattoni, L.-P. Morency, and D. Demirdjian, “Hiddenconditional random fields for gesture recognition,” in Proc. IEEECVPR, 2006, pp. 1521–1527.

[37] L. Wang and D. Suter, “Recognizing human activities from silhou-ettes: Motion subspace and factorial discriminative graphical model,”in Proc. IEEE CVPR, 2007, pp. 1–8.

[38] J. Zhang and S. Gong, “Action categorization with modified hiddenconditional random field,” Pattern Recognit., vol. 43, no. 1, pp.197–203, Jan. 2010.

[39] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actionsas space-time shapes,” in Proc. IEEE ICCV, 2005, pp. 1395–1402.

[40] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: Alocal SVM approach,” in Proc. ICPR, 2004, vol. 3, pp. 32–36.

[41] J. C. Niebles and L. Fei-fei, “A hierarchical model of shape and ap-pearance for human action classification,” in Proc. IEEE CVPR, 2007,pp. 1–8.

[42] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspiredsystem for action recognition,” in Proc. ICCV, 2007, pp. 1–8.

[43] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning re-alistic human actions from movies,” in Proc. IEEE CVPR, 2008, pp.1–8.

[44] L. P. Morency, A. Quattoni, and T. Darrell, “Latent-dynamic discrim-inative models for continuous gesture recognition,” in Proc. IEEECVPR, 2007, pp. 1–8.

[45] S. Wang, A. Quattoni, L. P. Morency, D. Demirdjian, and T. Darrell,“Hidden conditional random fields for gesture recognition,” in Proc.IEEE CVPR, 2006, pp. 1521–1527.

[46] S. M. Nixon and A. S. Aguado, Feature Extraction and Image Pro-cessing for Computer Vision. New York: Academic, 2008.

Kaiqi Huang (M’05–S’09–SM’09) received theM.S. degree in electrical engineering from NanjingUniversity of Science and Technology, Nanjing,China, and the Ph.D. degree in signal and informa-tion processing from Southeast University, Nanjing.After receiving the Ph.D. degree, he became a

Postdoctoral Researcher with the National Labora-tory of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences, Beijing, China,where he is currently an Associate Professor. Hehas published more than 80 papers on TPAMI, TIP,

TCSVT, TSMCB, CVIU, Pattern Recognition and CVPR, and ECCV. Hisinterests include visual surveillance, image and video analysis, human visionand cognition, computer vision, etc.Dr. Huang is a Program Committee Member of more than 50 international

conferences and workshops and is a board member of the IEEE Systems, Man,and Cybernetics Technical Committee on Cognitive Computing. He is theDeputy Secretary-General of the IEEE Beijing Section.


http

://iee

explo

repr

ojects

.blog

spot

.com

http

://iee

explo

repr

ojects

.blog

spot

.com


Yeying Zhang received the B.Sc. degree in electrical engineering in video pro-cessing and multimedia communication from Chengdu University, Chengdu,China, in 2008. He is currently working toward the Master’s degree in patternrecognition and intelligent system in the National Laboratory of Pattern Recog-nition, Institute of Automation, Chinese Academy of Sciences, Beijing, China.His current research interests include computer vision, pattern recognition,

human behavior analysis, etc.

Tieniu Tan (F’03) received the B.Sc. degree in elec-tronic engineering from Xi’an Jiaotong University,Xi’an, China, in 1984 and the M.Sc. and Ph.D. de-grees in electronic engineering from Imperial Col-lege of Science, Technology, and Medicine, London,U.K., in 1986 and 1989, respectively.In October 1989, he was with the Computational

Vision Group, Department of Computer Science,The University of Reading, Berkshire, U.K., wherehe was a Research Fellow, Senior Research Fellow,and Lecturer. In January 1998, he returned to China

to join the National Laboratory of Pattern Recognition (NLPR), Instituteof Automation, Chinese Academy of Sciences, Beijing, China, where he iscurrently a Professor and the Director of the NLPR. He has published morethan 200 research papers in refereed journals and conferences in the areasof image processing, computer vision, and pattern recognition. His currentresearch interests include image processing, machine and computer vision,pattern recognition, multimedia, and robotics.Dr. Tan was a Guest Editor of the International Journal of Computer Vision

(June 2000) and is an Associate Editor or member of the Editorial Board of eightinternational journals, including the TPAMI, TCSVT, and Pattern Recognition.He serves as Referee or Program Committee Member and Chair for many majornational and international journals and conferences. He is the Chair of the IAPRTechnical Committee on Signal Processing for Machine Intelligence and theChair of the Fellow Committee of the IEEE Beijing Section.


A discriminative model of motion and cross ratio for view invariant action recognition.bak

Documents

Transcript of A discriminative model of motion and cross ratio for view invariant action recognition.bak