View-Invariant Action Recognition Based on Artificial Neural Networks

12
1 View-invariant action recognition based on Artificial Neural Networks Alexandros Iosifidis, Anastasios Tefas, Member, IEEE, and Ioannis Pitas, Fellow, IEEE Abstract—In this paper, a novel view invariant action recog- nition method based on neural network representation and recognition is proposed. The novel representation of action videos is based on learning spatially related human body posture prototypes using Self Organizing Maps (SOM). Fuzzy distances from human body posture prototypes are used to produce a time invariant action representation. Multilayer perceptrons are used for action classification. The algorithm is trained using data from a multi-camera setup. An arbitrary number of cameras can be used in order to recognize actions using a Bayesian framework. The proposed method can also be applied to videos depicting interactions between humans, without any modification. The use of information captured from different viewing angles leads to high classification performance. The proposed method is the first one that has been tested in challenging experimental setups, a fact that denotes its effectiveness to deal with most of the open issues in action recognition. Index Terms—Human action recognition, Fuzzy Vector Quan- tization, Multi-layer Perceptrons, Bayesian Frameworks. I. I NTRODUCTION Human action recognition is an active research field, due to its importance in a wide range of applications, such as intelligent surveillance [1], human-computer interaction [2], content-based video compression and retrieval [3], augmented reality [4], etc. The term action is often confused with the terms activity and movement. An action (sometimes also called as movement) is referred to as a single period of a human motion pattern, such as a walking step. Activities consist of a number of actions/movements, i.e., dancing consists of successive repetitions of several actions, e.g. walk, jump, wave hand, etc. Actions are usually described by using either features based on motion information and optical flow [5], [6], or features devised mainly for action representation [7], [8]. Although the use of such features leads to satisfactory action recognition results, their computation is expensive. Thus, in order to perform action recognition at high frame rates, the use of simpler action representations is required. Neurobiological studies [9] have concluded that the human brain can perceive actions by observing only the human body poses (postures) during action execution. Thus, actions can be described as sequences of consecutive human body poses, in terms of human body silhouettes [10], [11], [12]. After describing actions, action classes are, usually, learned by training pattern recognition algorithms, such as Artificial Neural Networks (ANNs) [13], [14], [15], Support Vector Machines (SVMs) [16], [17] and Discriminant dimensionality reduction techniques [18]. In most applications, the camera A. Iosifidis, A. Tefas and I. Pitas are with the Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece. e-mail: {aiosif,tefas,pitas}@aiia.csd.auth.gr. viewing angle is not fixed and human actions are observed from arbitrary camera viewpoints. Several researchers have highlighted the significant impact of the camera viewing angle variations on the action recognition performance [19], [20]. This is the so-called viewing angle effect. To provide view-independent methods, the use of multi-camera setups has been adopted [21], [22], [23]. By observing the human body from different viewing angles, a view-invariant action representation is obtained. This representation is subsequently used to describe and recognize actions. Although multi-view methods address the viewing angle effect properly, they set a restrictive recognition setup, which is difficult to be met in real systems [24]. Specifically, they assume the same camera setup in both training and recognition phases. Furthermore, the human under consideration must be visible from all synchronized cameras. However, an action recognition method should not be based on such assumptions, as several issues may arise in the recognition phase. Humans inside a scene may be visible from an arbitrary number of cameras and may be captured from an arbitrary viewing angle. Inspired from this setting, a novel approach in view- independent action recognition is proposed. Trying to solve the generic action recognition problem, a novel view-invariant action recognition method based on ANNs is proposed in this paper. The proposed approach does not require the use of the same number of cameras in the training and recognition phases. An action captured by an arbitrary number N of cameras, is described by a number of successive human body postures. The similarity of every human body posture to body posture prototypes, determined in the training phase by a self-organizing neural network, is used to provide a time invariant action representation. Action recognition is performed for each of the N cameras by using a Multi-Layer Perceptron (MLP), i.e., a feed-forward neural network. Action recognition results are subsequently combined to recognize the unknown action. The proposed method performs view- independent action recognition, using an uncalibrated multi- camera setup. The combination of the recognition outcomes that correspond to different viewing angles leads to action recognition with high recognition accuracy. The main contributions of this paper are: a) the use of Self Organizing Maps (SOM) for identifying the basic posture prototypes of all the actions, b) the use of cumulative fuzzy distances from the SOM in order to achieve time-invariant action representations, c) the use of a Bayesian framework to combine the recognition results produced for each camera, d) the solution of the camera viewing angle identification problem using combined neural networks. The remainder of this paper is structured as follows. An Transactions on Neural Networks and Learning Systems IEEE 2012 ,Volume:23,Issue:3

description

View-Invariant Action Recognition Based on Artificial Neural Networks

Transcript of View-Invariant Action Recognition Based on Artificial Neural Networks

  • 1View-invariant action recognition based on ArtificialNeural Networks

    Alexandros Iosifidis, Anastasios Tefas, Member, IEEE, and Ioannis Pitas, Fellow, IEEE

    AbstractIn this paper, a novel view invariant action recog-nition method based on neural network representation andrecognition is proposed. The novel representation of action videosis based on learning spatially related human body postureprototypes using Self Organizing Maps (SOM). Fuzzy distancesfrom human body posture prototypes are used to produce a timeinvariant action representation. Multilayer perceptrons are usedfor action classification. The algorithm is trained using data froma multi-camera setup. An arbitrary number of cameras can beused in order to recognize actions using a Bayesian framework.The proposed method can also be applied to videos depictinginteractions between humans, without any modification. The useof information captured from different viewing angles leads tohigh classification performance. The proposed method is the firstone that has been tested in challenging experimental setups, afact that denotes its effectiveness to deal with most of the openissues in action recognition.

    Index TermsHuman action recognition, Fuzzy Vector Quan-tization, Multi-layer Perceptrons, Bayesian Frameworks.

    I. INTRODUCTIONHuman action recognition is an active research field, due

    to its importance in a wide range of applications, such asintelligent surveillance [1], human-computer interaction [2],content-based video compression and retrieval [3], augmentedreality [4], etc. The term action is often confused with theterms activity and movement. An action (sometimes also calledas movement) is referred to as a single period of a humanmotion pattern, such as a walking step. Activities consistof a number of actions/movements, i.e., dancing consists ofsuccessive repetitions of several actions, e.g. walk, jump,wave hand, etc. Actions are usually described by using eitherfeatures based on motion information and optical flow [5], [6],or features devised mainly for action representation [7], [8].Although the use of such features leads to satisfactory actionrecognition results, their computation is expensive. Thus, inorder to perform action recognition at high frame rates, the useof simpler action representations is required. Neurobiologicalstudies [9] have concluded that the human brain can perceiveactions by observing only the human body poses (postures)during action execution. Thus, actions can be described assequences of consecutive human body poses, in terms ofhuman body silhouettes [10], [11], [12].After describing actions, action classes are, usually, learned

    by training pattern recognition algorithms, such as ArtificialNeural Networks (ANNs) [13], [14], [15], Support VectorMachines (SVMs) [16], [17] and Discriminant dimensionalityreduction techniques [18]. In most applications, the camera

    A. Iosifidis, A. Tefas and I. Pitas are with the Department of Informatics,Aristotle University of Thessaloniki, Thessaloniki 54124, Greece. e-mail:faiosif,tefas,[email protected].

    viewing angle is not fixed and human actions are observedfrom arbitrary camera viewpoints. Several researchers havehighlighted the significant impact of the camera viewingangle variations on the action recognition performance [19],[20]. This is the so-called viewing angle effect. To provideview-independent methods, the use of multi-camera setupshas been adopted [21], [22], [23]. By observing the humanbody from different viewing angles, a view-invariant actionrepresentation is obtained. This representation is subsequentlyused to describe and recognize actions.Although multi-view methods address the viewing angle

    effect properly, they set a restrictive recognition setup, whichis difficult to be met in real systems [24]. Specifically, theyassume the same camera setup in both training and recognitionphases. Furthermore, the human under consideration must bevisible from all synchronized cameras. However, an actionrecognition method should not be based on such assumptions,as several issues may arise in the recognition phase. Humansinside a scene may be visible from an arbitrary numberof cameras and may be captured from an arbitrary viewingangle. Inspired from this setting, a novel approach in view-independent action recognition is proposed. Trying to solvethe generic action recognition problem, a novel view-invariantaction recognition method based on ANNs is proposed in thispaper. The proposed approach does not require the use ofthe same number of cameras in the training and recognitionphases. An action captured by an arbitrary number N ofcameras, is described by a number of successive human bodypostures. The similarity of every human body posture tobody posture prototypes, determined in the training phaseby a self-organizing neural network, is used to provide atime invariant action representation. Action recognition isperformed for each of the N cameras by using a Multi-LayerPerceptron (MLP), i.e., a feed-forward neural network. Actionrecognition results are subsequently combined to recognizethe unknown action. The proposed method performs view-independent action recognition, using an uncalibrated multi-camera setup. The combination of the recognition outcomesthat correspond to different viewing angles leads to actionrecognition with high recognition accuracy.The main contributions of this paper are: a) the use of

    Self Organizing Maps (SOM) for identifying the basic postureprototypes of all the actions, b) the use of cumulative fuzzydistances from the SOM in order to achieve time-invariantaction representations, c) the use of a Bayesian frameworkto combine the recognition results produced for each camera,d) the solution of the camera viewing angle identificationproblem using combined neural networks.The remainder of this paper is structured as follows. An

    Transactions on Neural Networks and Learning SystemsIEEE 2012 ,Volume:23,Issue:3

  • 2overview of the recognition framework used in the proposedapproach and a small discussion concerning the action recog-nition task is given in Section I-A. Section II presents detailsof the processing steps performed in the proposed method.Experiments for assessing the performance of the proposedmethod are described in Section III. Finally, conclusions aredrawn in Section IV.

    A. Problem Statement

    Let an arbitrary number of NC cameras capturing a scene ata given time instance. These cameras form a NC-view camerasetup. This camera setup can be a converging one or not. Inthe first case, the space which can be captured by all the NCcameras is referred as capture volume. In the later case, thecameras forming the camera setup are placed in such positionsthat there is not a space which can be simultaneously capturedby all the cameras. A converging and a non-converging camerasetup is illustrated in Figure 1. Nt video frames from aspecific camera fi; i = 1; :::; Nt, form a single-view videof = [fT1 ; f

    T2 ; :::; f

    TNt]T .

    (a) (b)

    Fig. 1. a) A converging and b) a non-converging camera setup.

    Actions can be periodic (e.g., walk, run) or not (e.g., bend,sit). The term elementary action refers to a single human actionpattern. In the case of periodic actions, the term elementaryaction refers to a single period of the motion pattern, suchas a walk step. In the case of non-periodic actions, the termelementary action refers to the whole motion pattern, i.e., abend sequence. Let A be a set of NA elementary action classesfa1; :::; aNAg, such as walk, jump, run, etc. Let a personperform an elementary action aj ; 1 j NA, captured byN < NC cameras. This results to the creation of N single-view action videos fi = [fTi1; f

    Ti2; :::; f

    TiNtj

    ]T ; i = 1; :::; N

    each depicting the action from a different viewing angle.Since different elementary actions have different durations,the number of video frames Ntj ; j = 1; :::; NA that consistelementary action videos of different action classes varies.For example, a run period consists of only 10 video frames,whereas a sit sequence consists of 40 video frames in averageat 25 fps. The action recognition task is the classificationof one or more action videos, depicting an unknown personperforming an elementary action in one of the known actionclasses specified by the action class set A.In the following, we present the main challenges for an

    action recognition method: The person can be seen from N NC cameras. The caseN < NC can occur either when the used camera setup isnot a converging one, or when using a converging camerasetup, the person performs the action outside the capturevolume, or in the case of occlusions.

    During the action execution, the person may changemovement direction. This affects the viewing angle he/sheis captured from each camera. The identification ofthe camera position with respect to the person bodyis referred as the camera viewing angle identificationproblem and needs to be solved in order to perform view-independent action recognition.

    Each camera may capture the person from an arbitrarydistance. This affects the size of the human body projec-tion in each camera plane.

    Elementary actions differ in duration. This is observedin different realizations of the same action performed bydifferent persons or even by the same person at differenttimes or under different circumstances.

    The method should allow continuous action recognitionover time.

    Elementary action classes highly overlap in the videoframe space, i.e., the same body postures may appear indifferent actions. For example many postures of classesjump in place and jump forward are identical. Further-more, variations in style can be observed between twodifferent realizations of the same action performed eitherby the same person or by different persons. Consideringthese observations, the body postures of a person per-forming one action may be the same to the body posturesof another person performing a different action. More-over, there are certain body postures that characterizeuniquely certain action classes. An action representationshould take into account all these observations in orderto lead to a good action recognition method.

    Cameras forming the camera setup may differ in resolu-tion and frame rate and synchronization errors may occurin real camera setups, that is, there might be a delaybetween the video frames produced by different cameras.

    The use of multi-camera setups involves the need ofcamera calibration for a specific camera setting. Theaction recognition algorithms need to be retrained evenfor small variations of the camera positions.

    The objective is the combination of all available informationcoming from all the available cameras depicting the personunder consideration to recognize the unknown action. The pro-posed method copes with all the above mentioned issues andconstraints. According to the best of the authors knowledgethis is the first time where an action recognition method istested against all these scenarios with very good performance.

    II. PROPOSED METHODIn this section, each step of the proposed method is de-

    scribed in detail. The extraction of posture data, used as inputdata in the remaining steps, is presented in subsection II-A. The use of a Self Organizing Map (SOM) to determinehuman body posture prototypes is described in subsection II-B. Action representation and classification are presented insubsections II-C and II-D, respectively. A variation of theoriginal algorithm, that exploits the observations viewingangle information is presented in subsections II-E and II-F.Finally, subsection II-G presents the procedure followed inthe recognition phase.

  • 3A. Preprocessing Phase

    As previously described, an elementary action is capturedby N cameras in elementary action videos consisting of Ntj ,1 j NA, video frames that depict one action period.The number Ntj may vary over action classes, as well as overelementary action videos coming from the same action class.Multi-period action videos are manually split in elementaryaction videos, which are subsequently used for training andtesting in the elementary action recognition case. In the caseof videos showing many action periods (continuous actionrecognition), a sliding window of possibly overlapping videosegments having suitably chosen length Ntw is used andrecognition is performed at each window position, as will bedescribed in Section III-D. In the following, the term actionvideo will refer to an elementary action video.Moving object segmentation techniques [25], [26] are ap-

    plied to each action video frame to create binary imagesdepicting persons body in white and the background in black.These images are centered at the persons center of mass.Bounding boxes of size equal to the maximum boundingbox enclosing persons body are extracted and rescaled toNH NW pixels to produce binary posture images of fixedsize. Eight binary posture images of eight actions (walk,run, jump in place, jump forward, bend, sit, fall andwave one hand) taken from various viewing angles are shownin Figure 2.

    Fig. 2. Posture images of eight actions taken from various viewing angles.

    Binary posture images are represented as matrices andthese matrices are vectorized to produce posture vectors p 2RD; D = NH NW . That is, each posture image is finallyrepresented by a posture vector p. Thus, every action videoconsisting of Ntj video frames is represented by a set ofposture vectors pi 2 RD; i = 1; :::; Ntj . In the experimentspresented in this paper the values NH = 64, NW = 64 havebeen used and binary posture images were scanned column-wise.

    B. Posture prototypes Identification

    In the training phase, posture vectors pi; i = 1; :::; Np, Npbeing the total number of posture vectors consisting all theNT training action videos, having Ntj video frames each, areused to produce action independent posture prototypes withoutexploiting the known action labels. To produce spatially relatedposture prototypes, a SOM is used [27]. The use of SOM leadsto a topographic map (lattice) of the input data, in which thespatial locations of the resulting prototypes in the lattice areindicative of intrinsic statistical features of the input postures.The training procedure for constructing the SOM is based onthree procedures:1) Competition: For each of the training posture vectors

    pi, its Euclidean distance from every SOM weight, wSj 2RD; j = 1; :::; NS is calculated. The winning neuron is the

    one that gives the smallest distance:

    j = arg minjk pi wSj k2: (1)

    2) Cooperation: The winning neuron j indicates the cen-ter of a topological neighborhood hj . Neurons are exciteddepending on their lateral distance, rj , from this neuron. Atypical choice of hj is the Gaussian function:

    hjk(n) = exp(r2jk

    22(n)); (2)

    where k corresponds to the neuron at hand, n is the iterationof the algorithm, rjk is the Euclidean distance betweenneurons j and k in the lattice space and is the effectivewidth of the topological neighborhoood. is a function ofn: (n) = 0 exp( nN0 ), where N0 is the total number oftraining iterations. (0) = lw+lh4 in our experiments. lw andlh are the lattice width and height, respectively.3) Adaptation: At this step, each neuron is adapted with

    respect to its lateral distance from the wining neuron asfollows:

    wSk(n+ 1) = wSk(n) + (n)hjk(n)(pi wSk(n)); (3)where (n) is the learning-rate parameter: (n) =(0) exp( nN0 ). (0) = 0:1 in our experiments.The optimal number of update iterations is determined by

    performing a comparative study on the produced lattices. In apreliminary study, we have conducted experiments by using avariety of iteration numbers for the update procedure. Specif-ically, we trained the algorithm by using 20, 50, 60, 80, 100and 150 update iterations. Comparing the produced lattices,we found that the quality of the produced posture prototypesdoes not change for update iterations number greater than 60.The optimal lattice topology is determined using the cross-validation procedure, which is a procedure that determinesthe ability of a learning algorithm to generalize over data thatwas not trained on. That is, the learning algorithm is trainedusing all but some training data, which are subsequently usedfor testing. This procedure is applied multiple times (folds).The test action videos used to determine the optimal latticetopology were all the action videos of a specific person notincluded in the training set. A 12 12 lattice of postureprototypes produced using action videos of action classeswalk, run, jump in place, jump forward, bend, sit,fall and wave one hand captured from eight viewing angles0o, 45o, 90o, 135o, 180o, 225o, 270o and 315o(with respect to the persons body) is depicted in Figure 3.As can be seen, some posture prototypes correspond to body

    postures that appear in more than one actions. For example,posture prototypes in lattice locations (1; f); (1; g); (7; d)describe postures of actions jump in place, jump for-ward and sit, while posture prototypes in lattice locations(3; k); (6; l); (8; l) describe postures of actions walk andrun. Moreover, some posture prototypes correspond to pos-tures that appear to one only action class. For example,posture prototypes in lattice locations (1; i); (10; i); (12; e)describe postures of action bend, while posture prototypesin lattice location (4; f); (3; i); (5; j) describe postures of

  • 4Fig. 3. A 1212 SOM produced by posture frames of eight actions capturedfrom eight viewing angles.

    action wave one hand. Furthermore, one can notice thatsimilar posture prototypes lie in adjacent lattice positions.This results to a better posture prototype organization. Toillustrate the advantage given by the SOM posture prototyperepresentation in the action recognition task, Figure 4 presentsthe winning neurons in the training set used to produce the12 12 lattice presented in Figure 3 for each of the actionclasses. In this Figure, only the winning neurons are shown,while the grayscale value of the enclosing square is a functionof their wins number. That is, after determining the SOM, thesimilarity between all the posture vectors belonging to thetraining action videos and the SOM neurons was computedand the winning neurons corresponding to each posture vectorwas determined. The grayscale value of the enclosing squaresis high for neurons having large number of wins and small forthose having a small one. Thus, neurons enclosed in squareswith high grayscale value correspond to human body posesappearing more often in each action type. As can be seen,posture prototypes representing each action are quite differentand concentrate in neighboring parts of the lattice. Thus,one can expect that the more non-overlapping these mapsare the more discriminant representation they offer for actionrecognition.

    (a) (b) (c) (d)

    (e) (f) (g) (h)

    Fig. 4. Winning neurons for eight actions: a) walk, b) run, c) jump in place,d) jump forward, e) bend, f) sit, g) fall and h) wave one hand.

    By observing the lattice shown in Figure 3, it can be seenthat the spatial organization of the posture prototypes definesareas where posture prototypes correspond to different viewingangles. To illustrate this, Figure 5 presents the wining neurons

    for all action videos that correspond to a specific viewingangle. It can be seen that the wining neurons correspondingto different views are quite distinguished. Thus, the samerepresentation can also be used for viewing angle identifica-tion, since the maps that correspond to different views arequite non-overlapping. Overall, the SOM posture prototyperepresentation has enough discriminant power to provide agood representation space for the action posture prototypevectors.

    (a) (b) (c) (d)

    (e) (f) (g) (h)

    Fig. 5. Wining neurons for eight views: a) 0o, b) 45o, c) 90o, d) 135o, e)180o, f) 225o, g) 270o and h) 315o.

    C. Action Representation

    Let posture vectors pi; i = 1; :::; Ntj ; j = 1; :::; NA consistan action video. Fuzzy distances of every pi to all the SOMweights wSk; k = 1; :::; NS are calculated to determine thesimilarity of every posture vector with every posture prototype:

    dik = (k pi wSk k2) 2m1 ; (4)where m is the fuzzification parameter (m > 1). Its optimalvalue is determined by applying the cross-validation proce-dure. We have experimentally found that a value of m = 1:1provides satisfactory action representation and, thus, this valueis used in all the experiments presented in this paper. Fuzzydistances allow for a smooth distance representation betweenposture vectors and posture prototypes.After the calculation of fuzzy distances, each posture

    vector is mapped to the following distance vector di =[di1; di2; :::; diNS ]

    T . Distance vectors di; i = 1; :::; Ntj arenormalized to produce membership vectors ui = dikdik ; ui 2RNS , that correspond to the final representations of the pos-ture vectors in the SOM posture space. The mean vectors = 1Ntj

    PNtji=1 ui; s 2 RNS of all the Ntj membership

    vectors comprising the action video is called action vectorand represents the action video.The use of the mean vector leads to a duration invariant

    action representation. That is, we expect the normalized cu-mulative membership of a specific action to be invariant tothe duration of the action. This expectation is enhanced bythe observation discussed in Subsection II-B and illustrated inFigure 4. Given that the winning SOM neurons correspondingto different actions are quite distinguished, we expect that thedistribution of fuzzy memberships to the SOM neurons willcharacterize actions. Finally, the action vectors representing all

  • 5NT training action videos sj ; j = 1; :::; NT are normalized tohave zero mean and unit standard deviation. In the test phase,all the N action vectors sk; k = 1; :::; N that correspond to Ntest action videos depicting the person from different viewingangles are normalized accordingly.

    D. Single-view Action Classification

    As previously described, action recognition performs theclassification of an unknown incoming action captured byN NC action videos, to one of the NA known action classesaj ; j = 1; :::; NA contained in an action class set A. Using theSOM posture prototype representation, which leads to spatiallyrelated posture prototypes, and expecting that action videos ofevery action class will be described by spatially related posturevectors, a MLP is proposed for the action classification taskconsisting of NS inputs (equal to the dimensionality of actionvectors s), NA outputs (each corresponding to an action classaj ; j = 1; :::; NA) and using the hyperbolic tangent functionfsigmoid(x) = tanh(bx), where the values = 1:7159 andb = 23 were chosen [28], [29].In the training phase, all NT training action vectors si; i =

    1; :::; NT accompanied by their action labels are used todefine MLP weightsWA using the Backpropagation algorithm[30]. Outputs corresponding to each action vector, oi =[oi1; :::; oiNA ]

    T , are set to oik = 0:95 for action vectorsbelonging to action class k and oik = 0:95 otherwise,k = 1; :::; NA. For each of the action vectors si, MLP responseo^i = [o^i1; :::; o^iNA ]

    T is calculated by:

    o^ik = fsigmoid(sTi wAk) (5)

    where wAk is a vector that contains the MLP weights corre-sponding to output k.The training procedure is performed in an on-line form,

    i.e., adjustments of the MLP weights are performed for eachtraining action vector. After the feed of a training action vectorsi and the calculation of the MLP response o^i, the modificationof weight that connects neurons i and j follows the updaterule:

    WAji(n+ 1) = cWAji(n) + j(n)yi(n); (6)

    where j(n) is the local gradient for the j-th neuron, yi isthe output of the i-th neuron, is the learning-rate, c is apositive number, called momentum constant ( = 0:05 andc = 0:1 in the experiments presented in this paper) and nis the iteration number. Action vectors are introduced to theMLP in a random sequence. This procedure is applied untilthe Mean Square Error (MSE) falls under an acceptable errorrate ":

    E[(1

    NT(o^i oi)2) 12 ] < ": (7)

    The optimal MSE parameter value is determined performingthe cross-validation procedure using different threshold valuesfor the mean square error (MSE) " and the number of iterationsparameters of the algorithm Nbp. We used values of " equalto 0:1, 0:01 and 0:001 and values of Nbp equal to 100, 500and 1000. We found the best combination to be " = 0:01 andNmax = 1000 and used them in all our experiments.

    In the test phase, a set S of action vectors si; i = 1; :::; Ncorresponding to N action videos captured from all the Navailable cameras depicting the person is obtained. To classifyS to one of the NA action classes aj specified by the actionset A, each of the N action vectors si is fed to the MLP andN responses are obtained:

    o^i = [o^i1; :::; o^iNA ]; i = 1; :::; N: (8)

    Each action video is classified to the action class aj ; j =1; :::; NA that corresponds to the MLP maximum output:

    a^i = argmaxj

    o^ij ; i = 1; :::; N; j = 1; :::; NA: (9)

    Thus, a vector a^ = [a^1; :::; a^N ]T 2 RN containing allthe recognized action classes is obtained. Finally, expectingthat most of the recognized actions a^i will correspond to theactual action class of S, S is classified to an action class byperforming majority voting over the action classes indicatedin a^.Using this approach, view-independent action recognition

    is achieved. Furthermore, as the number N of action vectorsforming S may vary, a generic multi-view action recognitionmethod is obtained. In the above described procedure, noviewing angle information is used in the combination ofclassification results a^i; i = 1; :::; N , that correspond to eachof the N cameras, to produce the final recognition result. Asnoted before, actions are quite different when they are capturedfrom different viewing angles. Thus, some views may be morediscriminant for certain actions. For example, actions walkand run are well distinguished when they are captured froma side view but they seem similar when they are captured fromthe front view. In addition, actions wave one hand and jumpin place are well distinguished when they are captured fromthe front or back view, but not from the side views. Therefore,instead of majority vote, a more sophisticated method can beused to combine all the available information and producethe final action recognition result by exploiting the viewingangle information. In the following, a procedure based on aBayesian framework is proposed in order to combine the actionrecognition results from all N cameras.

    E. Combination of single-view action classification resultsusing a Bayesian Framework

    The classification of an action vector set S consisting ofN NC action vectors si; i = 1; :::; N , each correspondingto an action video coming from a specific camera used forrecognition, in one of the action classes aj ; j = 1; :::; NA ofthe action class set A, can be performed using a probabilisticframework. Each of the N action vectors si of S is fedto the MLP and N vectors containing the responses o^i =[o^i1; o^i2; :::; o^iNA ]

    T are obtained. The problem to be solved isto classify S in one of the action classes aj given these obser-vations, i.e., to estimate the probability P (aj jo^T1 ; o^T2 ; :::; o^TNC )of every action class given MLP responses. In the case whereN < NC , N NC MLP outputs o^i will be set to zero, as norecognition result is provided for these cameras. Since MLPresponses are real valued, P (aj jo^T1 ; o^T2 ; :::; o^TNC ) estimation

  • 6is very difficult. Let a^i denote the action recognition resultcorresponding to the test action vector si representing theaction video captured by the i-th camera, taking values in theaction class set A. Without loss of generality, si; i = 1; :::; Nis assumed to be classified to the action class that providesthe highest MLP response, i.e., a^i = argmax

    jo^ij . That is,

    the problem to be solved is to estimate the probabilitiesP (aj ja^1; a^2; :::; a^NC ) of every action class aj , given the dis-crete variables a^i; i = 1; :::; NC . Let P (aj) denote the a prioriprobability of action class aj and P (a^i) the a priori probabilityof recognizing a^i from camera i. Let P (a^1; a^2; :::; a^NC ) bethe joint probabilities of all the NC cameras observing oneof the NA action classes aj . Furthermore, the conditionalprobabilities P (a^1; a^2; :::; a^NC jaj) that camera 1 recognizesaction class a^1, camera 2 recognizes action class a^2, etc.,given that the actual action class of S is aj , can be calculated.Using these probabilities, the probability P (aj ja^1; a^2; :::; a^NC )of action class aj ; j = 1; :::; NA, given the classification resultsa^i can be estimated using the Bayes formula:

    P (aj ja^1; a^2; :::; a^N ) = P (a^1; a^2; :::; a^N jaj) P (aj)PNAl=1 P (a^1; a^2; :::; a^N jal) P (al)

    : (10)

    In the case of equiprobable action classes, P (aj) = 1NA . Ifthis is not the case, P (aj) should be set to their real values andthe training data should be chosen accordingly. Expecting thattraining and evaluation data come from the same distributions,P (a^1; a^2; :::; a^NC jaj) can be estimated during the trainingprocedure. In the evaluation phase, S can be classified to theaction class providing the maximum conditional probability,i.e., a^S = argmax

    jP (aj ja^1; a^2; :::; a^NC ). However, such a

    system cannot be applied in a straightforward manner, sincethe combinations of all the NC cameras providing NA + 1action recognition results is enormous. The case NA + 1refers to the situation where a camera does not provide anaction recognition result, because the person is not visiblefrom this camera. For example, in the case of NC = 8 andNA = 8, the number of all possible combinations is equalto (NA + 1)NC = 43046721. Thus, in order to estimate theprobabilities P (a^1; a^2; :::; a^NC jaj) an enormous training dataset should be used.To overcome this difficulty, the action classification task

    could be applied to each of the N available cameras indepen-dently and the N classification results could subsequently becombined to classify the action vector set S to one of the actionclasses aj . That is, for camera i, the probability P (aj ja^i) ofaction class aj given the recognized action class a^i can beestimated using the Bayes formula:

    P (aj ja^i) = P (a^ijaj) P (aj)PNAl=1 P (a^ijal) P (al)

    : (11)

    As previously described, since the person can freely move,the viewing angle can vary for each camera, i.e., if a cameracaptures the person from the front viewing angle at a giventime instance, a change in his/her motion direction may resultthat this camera captures him/her from a side viewing angleat a subsequent time instance. Since the viewing angle hasproven to be very important in the action recognition task,

    the viewing angle information should be exploited to improvethe action recognition accuracy. Let P (a^i; v^i) denote the jointprobability denoting that camera i recognizes action classa^i captured from viewing angle v^i. Using P (a^i; v^i), theprobability P (aj ja^i; v^i) of action class aj , given a^i and v^ican be estimated by:

    P (aj ja^i; v^i) = P (a^i; v^ijaj) P (aj)PNAl=1 P (a^i; v^ijal) P (al)

    : (12)

    The conditional probabilities P (aj ja^i; v^i) are estimated inthe training phase. After training the MLP, action vectorscorresponding to the training action videos are fed to theMLP in order to obtain its responses. Each action vectoris classified to the action class providing the highest MLPresponse. Exploiting the action and viewing angle labelsaccompanying the training action videos, the conditional prob-abilities P (a^i; v^ijaj) corresponding to the training set arecalculated. Finally, P (aj ja^i; v^i) are obtained using Equation(12). In the recognition phase, each camera provides an actionrecognition result. The viewing angle corresponding to eachcamera should also be estimated automatically to obtain v^i.A procedure to this end, which exploits the vicinity propertyof the SOM posture prototype representation is presentedin the next subsection. After obtaining the a^i and v^i, i =1; :::; N , the action vector set S is classified to the actionclass providing the maximum probability sum [31], i.e., a^S =argmax

    j

    PNi=1 P (aj ja^i; v^i).

    In (12), the denominatorPNA

    j=1 P (a^i; v^ijaj) P (aj) =P (a^i; v^i) refers to the probability that action vectors cor-responding to action videos captured from the recognizedviewing angle v^i belong to the recognized action class a^i. Thisprobability is indicative of the ability of each viewing angleto correctly recognize actions. Some views may offer moreinformation to correctly classify some of the action vectors toan action class. In the case where v^i is capable to distinguisha^i from all the other action classes, P (a^i; v^i) will be equalto its actual value, thus having a small impact to the finaldecision P (aj ja^i; v^i). In the case where v^i confuses someaction classes, P (a^i; v^i) will have a value either higher, orsmaller than its actual one and will influence the decisionP (aj ja^i; v^i). For example, consider the case of recognizingaction class bend from the front viewing angle. Becausethis action is well distinguished from all the other actions,when it is captured from the front viewing angle, P (a^i; v^i)will not influence the final decision. The case of recognizingaction classes jump in place and sit from a 270o sideviewing angle is different. Action videos belonging to theseaction classes captured from this viewing angle are confused.Specifically, all the action videos recognized to belong toaction class jump in place actually belong to this class, whilesome action videos recognized to belong to action class sitbelong to action class jump in place. In this case P (a^i; v^i)will be of high value for the action class sit, thus providinga low value of P (aj ja^i; v^i), while P (a^i; v^i) will be of lowvalue for the action class jump in place, thus providing ahigh value of P (aj ja^i; v^i). That is, the recognition of actionclasses sit from the 270o side viewing angle is ambiguous,

  • 7as it is probable that the action video belongs to action classjump in place, while the recognition of the action class jumpin place from the same viewing angle is of high confidence,as action videos recognized to belong to action class jump inplace actually belong to this class.The term P (a^i; v^ijaj) in (12) refers to the probability of

    the i-th action vector to be detected as action video capturedfrom viewing angle v^i and classified to action class a^i,given that the actual action class of this action vector isaj . This probability is indicative of the similarity betweenactions when they are captured from different viewing anglesangles and provides an estimate of action discriminationfor each viewing angle. Figure 6 illustrates the probabilitiesP (a^i; v^ijaj); j = 1; :::; NA; i = 1; :::; NC ; a^i = aj , i.e., theprobabilities to correctly classify an action video belongingto action class aj from each of the viewing angles v^i for anaction class set A = fwalk, run, jump in place, jumpforward, bend, sit, fall, wave one handg producedusing a 12 12 SOM and an 8-camera setup, in which eachcamera captures the person from one of the eight viewingangles V = f0o; 45o; 90o; 135o; 180o; 225o; 270o; 315og. Inthis Figure, it can be seen that action vectors belonging toaction classes jump in place, bend, and fall are almostcorrectly classified by every viewing angle. On the other hand,action vectors belonging to the remaining actions are moredifficult to be correctly classified for some viewing angles. Aswas expected, the side views are the most capable in terms ofclassification for action classes walk, run, jump forwardand sit, while in the case of action class wave one handthe best views are the frontal and the back ones.

    Fig. 6. Single-view action classification results presented as input to theBayesian framework for eight actions captured from eight viewing angles.

    F. Camera Viewing Angle Identification

    The proposed method utilizes a multi-camera setup. Theperson that performs an action can freely move and thisaffects the viewing angles he/she is captured by each camera.Exploiting the SOM posture prototype representation in Figure

    3a and the observation that posture prototypes correspondingto each viewing angle lie in different lattice locations aspresented in Figure 5, a second MLP is proposed to identifythe viewing angle v^i the person is captured from each camera.Similarly to the MLP used in action classification, it consistsof NS input nodes (equal to the dimensionality of actionvectors s), NC outputs (each corresponding to a viewingangle) and uses the hyperbolic tangent function as activationfunction. Its training procedure is similar to the one presentedin Subsection II-G. However, this time, the training outputs areset to oik = 0:95; k = 1; :::; NC , for action vectors belongingto action videos captured from the k-th viewing angle andoik = 0:95 otherwise.In the test phase, each of the N action vectors si consisting

    an action video set S that corresponds to the same actioncaptured from different viewing angles, is introduced to theMLP and the corresponding to each action vector viewingangle v^i is recognized based on the maximum MLP response:

    v^i = argmaxj

    o^ij ; i = 1; :::; N; j = 1; :::; NC : (13)

    G. Action Recognition (test phase)

    Let a person performing an action captured from N NCcameras. In the case of elementary action recognition, thisaction is captured in N action videos, while in the case ofcontinuous action recognition, a sliding window consisted ofNtw video frames is used to create the N action videosused to perform action recognition at every window location.These videos are preprocessed as discussed in Section II-Ato produce N Nt posture vectors pij ; i = 1; :::; N; j =1; :::; Nt, where Nt = Ntj or Nt = Ntw in the elementary andthe continuous action recognition tasks, respectively. Fuzzydistances dijk from all the test posture vectors pij to everySOM posture prototype wSk; k = 1; :::; NS , are calculatedand a set S of N test action vectors, si, is obtained. Thesetest action vectors are fed to the action recognition MLPand N action recognition results a^i are obtained. In the caseof majority voting, the action vector set S is classified tothe action class aj that has the most votes. In the Bayesianframework case, the N action vectors si are fed to the view-ing angle identification MLP to recognize the correspondingviewing angle v^i and the action vector set S is classified to theaction class aj that provides the highest cumulative probabilityaccording to the Bayesian decision. Figure 7 illustrates theprocedure followed in the recognition phase for the majorityvoting and the Bayesian framework cases.

    III. EXPERIMENTS

    The experiments conducted in order to evaluate the perfor-mance of the proposed method are presented in this section.To demonstrate the ability of the proposed method to correctlyclassify actions performed by different persons, as variationsin action execution speed and style may be observed, theleave-one-person-out cross-validation procedure was appliedin the i3DPost multi-view action recognition database [32] andthe experiments are discussed in subsection III-C. SubsectionIII-D discusses the operation of the proposed method in

  • 8Fig. 7. Action recognition system overview (test phase).

    case of multi-period action videos. Subsection III-E presentsits robustness in the case of action recognition at differentframe rates between training and test phases. In III-F acomparative study that deals with the ability of every viewingangle to correctly classify actions is presented. The case ofdifferent camera setups in the training and recognition phasesis discussed in Subsection III-G, while the case of actionrecognition using an arbitrary number of cameras at the testphase is presented in subsection III-H. The ability of theproposed approach to perform action recognition in the caseof human interactions is discussed in Subsection III-I.

    A. The i3DPost multi-view database

    The i3DPost multi-view database [32] contains 80 highresolution image sequences depicting eight persons performingeight actions and two person interactions. Eight camerashaving a wide 45o viewing angle difference to provide 360o

    coverage of the capture volume were placed on a ring of 8mdiameter at a height of 2m above the studio floor. The studiowas covered by blue background. The actions performed in 64video sequences are: walk (wk), run (rn), jump in place(jp), jump forward (jf), bend (bd), fall (fl), sit on a chair(st) and wave one hand (wo). The remaining 16 sequencesdepict two persons that interact. These interactions are: shakehand (sh) and pull down (pl).

    B. The IXMAS multi-view database

    The INRIA (Institut National de Recherche en Informa-tique et Automatique) Xmas Motion Acquisition Sequencesdatabase [22] contains 330 low resolution (291 390 pixels)image sequences depicting 10 persons performing 11 actions.Each sequence has been captured by five cameras. The personsfreely change position and orientation. The actions performedare: check watch, cross arm, scratch head, sit down, getup, turn around, walk in a circle, wave hand, punch,

    kick, and pick up. Binary images denoting the personsbody are provided by the database.

    C. Cross-validation in i3DPost multi-view database

    The cross-validation procedure described in Subsection II-Bwas applied to the i3DPost eight-view database, using the ac-tion video sequences of the eight persons. Action videos weremanually extracted and binary action videos were obtained bythresholding the blue color in the HSV color space. Figure8a illustrates the recognition rates obtained for various SOMlattice topologies for the majority voting and the Bayesianframework cases. It can be seen that high recognition rateswere observed. The optimal topology was found to be a 1212lattice. A recognition rate equal to 93:9% was obtained forthe majority voting case. The Bayesian approach outperformsthe majority voting one, providing a recognition rate equal to94:04% for the view-independent approach. As can be seen,the use of viewing angle information results to an increase ofthe recognition ability. The best recognition rate was found tobe equal to 94:87%, for the Bayesian approach incorporatingthe viewing angle recognition results. The confusion matrixcorresponding to the best recognition result is presented inTable I. In this matrix, rows represent the actual action classesand columns the recognition results. As can be seen, actionswhich contain discriminant body postures, such as bend,fall and wave right hand are perfectly perfectly classified.Actions having large number of similar body postures, suchas walk-run, or jump in place-jump forward-sit, aremore difficult to be correctly classified. However, even forthese cases, the classification accuracy is very high.

    (a) (b)

    Fig. 8. a) Action recognition rates vs various lattice dimensions of the SOM.b) Recognition results in the case of continuous action recognition.

    D. Continuous action recognition

    This section presents the functionality of the proposedmethod in the case of continuous (multiple period) actionrecognition. Eight multiple period videos, each correspondingto one viewing angle, depicting one of the persons of thei3DPost eight-view action database were manually created byconcatenating single period action videos. The algorithm wastrained using the action videos depicting the remaining sevenpersons using a 12 12 lattice and combining the classifica-tion results corresponding to each camera with the Bayesianframework. In the test phase, a sliding window of Ntw = 21video frames was used and recognition was performed at everysliding window position. A majority vote filter, of size equalto 11 video frames, was applied at every classification result.

  • 9Figure 8b illustrates the results of this experiment. In thisFigure, ground truth is illustrated by a continuous line andrecognition results by a dashed one. In the first 20 frames noaction recognition result was produced, as the algorithm needs21 frames (equal to the frames of the sliding window NtW )to perform action recognition. Moreover, a delay is observedin the classification results, as the algorithm uses observationsthat refer to past video frames (t; t 1; :::; t Nt + 1). Thisdelay was found to be between 12 and 21 video frames. Onlyone recognition error occurred at the transition between actionssit and fall.

    TABLE ICONFUSION MATRIX FOR EIGHT ACTIONS.

    wk rn jp jf bd st fl wowk 0.95 0.05rn 0.05 0.95jp 0.92 0.02 0.06jf 0.05 0.9 0.05bd 1st 0.13 0.87fl 1wo 1

    E. Action recognition in different video frame rates

    To simulate the situation of recognizing actions usingcameras of different frame rates, between training and testphases, an experiment was set as follows. The cross-validationprocedure using a 1212 lattice and the Bayesian frameworkwas applied for different camera frame rates in the test phase.That is, in the training phase the action videos depicting thetraining persons were used to train the algorithm using theiractual number of frames. In the test phase, the number offrames consisting the action videos were fewer, in order toachieve recognition at lower frame rate. That is, for actionrecognition at the half frame rate, the test action videosconsisted from the even-numbered frames, i.e., ntmod 2 = 0,where nt is the frame number of each video frame and modrefers to the modulo operator. In the case of a test frame rateequal to 1=3 of the training frame rate, only frames with framenumber nt mod 3 = 0 were used, etc. In the general case,where the test to training video frame rate ratio was equalto 1K , the test action videos consisted of the video framessatisfying nt mod K = 0. Figure 9a shows the results forvarious values ofK. It can be seen that the frame rate variationbetween the training and test phases does not influence theperformance of the proposed method. In fact, it was observedthat, for certain actions, a single posture frame that depicts awell distinguished posture of the action is enough to producea correct classification result. This verifies the observationmade in Subsection II-D that human body postures of differentactions are placed at different positions on the lattice. Thus, thecorresponding neurons are responsible to recognize the correctaction class. To verify this, the algorithm was tested usingsingle body posture masks that depict a person from variousviewing angles. Results of this experiment are illustrated inFigure 10. In this Figure, it can be seen that the MLP can

    correctly classify action vectors that correspond to singlehuman body postures. Even for difficult cases, such as Walk0o, or Run 315o, the MLP can correctly recognize the actionat hand.

    (a) (b)

    Fig. 9. a) Recognition results for different video frame rates between trainingand test phase. b) Action recognition rates vs various occlusion levels.

    Fig. 10. MLP responses for single human body posture images as input.

    F. Actions versus viewing angleA comparative study that specifies the action discrimination

    from different viewing angles is presented in this subsection.Using a 12 12 lattice and the Bayesian framework, thecross-validation procedure was applied for the action videosdepicting the actions in each of the eight viewing anglesf0o, 45o, 90o, 135o, 180o, 225o, 270o, 315og. That is, eightsingle-view elementary action recognition procedures wereperformed. Figure 11 presents the recognition rates achievedfor each of the actions. In this Figure, the probability tocorrectly recognize an incoming action from every viewingangle is presented, e.g., the probability to correctly recognizea walking sequence captured from the frontal view is equalto 77:7%. As was expected, for most action classes, the sideviews are the most discriminant ones and result in the bestrecognition rates. In the case of action wave one hand, thefrontal and the back views are the most discriminant ones andresult in the best recognition rates. Finally, well distinguishedactions, such as bend and fall, are well recognized fromany viewing angle. This can be explained by the fact that thebody postures that describe them are quite distinctive at anyviewing angle.Table II presents the overall action recognition accuracy

    achieved in every single-view action recognition experiment.In this Figure, the probability to correctly recognize an incom-ing action from any viewing angle is presented. For example,

  • 10

    the probability to correctly recognize one of the eight actionscaptured from the frontal view is equal to 77:5%. As can beseen, side views result in better recognition rates, because mostof the actions are well discriminated when observed by sideviews. The best recognition accuracy is equal to 86; 1% andcomes from a side view (135o). It should be noted that the useof the Bayesian network improves the recognition accuracyas the combination of these recognition outputs leads to arecognition rate equal to 94:87%, as discussed in SubsectionIII-C.

    Fig. 11. Recognition rates of different actions when observed from differentviewing angles.

    TABLE IIRECOGNITION RATES OBTAINED FOR EACH VIEWING ANGLE.

    0o 45o 90o 135o

    77:5% 80:9% 82:4% 86:1%180o 215o 260o 315o

    74:1% 80:5% 79:2% 80:2%

    G. Action recognition using reduced camera setups

    In this subsection, a comparative study between differentreduced camera setups is presented. Using a 1212 lattice, thecross-validation procedure was applied for different reducedtest camera setups. In the training phase, all action videosdepicting the training persons from all the eight cameras wereused to train the proposed algorithm. In the test phase, onlythe action videos depicting the test person from the camerasspecified by the reduced camera setup were used. Becausethe movement direction of the eight persons varies, camerasin these experiments do not correspond to a specific viewingangle, i.e., camera #1 may or may not depict to the personsfront view. Figure 12 presents the recognition rates achievedby applying this procedure for eight different camera setups.As can be seen, a recognition rate equal to 92:3% was achievedusing only 4 cameras having a 90o viewing angle difference toprovide 360o coverage of the capture volume. It can be seen

    that, even for 2 cameras placed at arbitrary viewing angles arecognition rate greater than 83% is achieved.

    H. Recognition in occlusion

    To simulate the situation of recognizing actions using anarbitrary number of cameras an experiment was set as follows.The cross-validation procedure using a 12 12 lattice andthe Bayesian framework was applied for a varying number ofcameras in the test phase. That is, in the training phase theaction videos depicting the training persons from all the eightcameras were used to train the algorithm. In the test phase,the number and the capturing view of the testing personsaction videos were randomly chosen. This experiment wasapplied for a varying number of cameras depicting the testingperson. The recognition rates achieved in these experimentscan be seen in Figure 9b. Intuitively, we would expect theaction recognition accuracy to be low when using a smallnumber of cameras in the test phase. This is due to the viewingangle effect. By using a large number of cameras in the testphase, the viewing angle effect should be addressed properly,resulting to an increased action recognition accuracy. Usingone arbitrarily chosen camera, a recognition rate equal to 79%was obtained, while using four arbitrarily chosen cameras, therecognition rate was increased to 90%. Recognition rates equalto 94:85% and 94:87% were observed using five and eightcameras, respectively. As it was expected, the use of multiplecameras resulted to an increase of action recognition accuracy.This experiment illustrates the ability of the proposed approachto recognize actions at high accuracy, in the case of recognitionusing an arbitrary number of cameras that depict the personfrom arbitrary view angles.

    Fig. 12. Recognition rates for the eight different camera setups.

    I. Recognition of human interactions

    As previously described, the action recognition task refersto the classification of actions performed by one person. Todemonstrate the ability of the proposed approach to correctlyclassify actions performed by more than one persons, i.e.,human interactions, the cross-validation procedure was appliedto the i3DPost eight-view database including the action videosthat depict human interactions, e.g.: shake hands and pulldown. A recognition rate equal to 94:5% was observed for alattice of 13 13 neurons and the Bayesian framework. Anexample of 1313 lattice is shown in Figure 13. The confusionmatrix of this experiment can be seen in Table III.

  • 11

    Fig. 13. A 13 13 SOM produced by posture frames of actions andinteractions.

    To illustrate the continuous action recognition functionalityof a system that can recognize interactions, an experimentwas set up as follows: the algorithm was trained using theaction videos depicting the seven persons of the I3DPostaction dataset including the two interactions (shake handand pull dawn) using a 13 13 lattice topology and theBayesian framework. The original action videos depicting theside views of the eight person performing these interactionswas tested using a sliding window of NtW = 21 video frames.Figure 14 illustrates qualitative results of this procedure. Whenthe two persons were separated, each person was tracked atsubsequent frames using a closest area blob tracking algorithmand the binary images depicting each person were fed to thealgorithm for action/interaction recognition. When the twopersons interacted, the whole binary image was introducedto the algorithm for recognition. In order to use the remainingcameras, more sophisticated blob tracking methods could beused [33], [34]. As can be seen, the proposed method can beextended to recognize interactions in a continuous recognitionsetup.

    TABLE IIICONFUSION MATRIX FOR EIGHT ACTIONS AND TWO INTERACTIONS

    wk rn jp jf bd hs pl st fl wowk 0.95 0.05rn 0.05 0.95jp 0.81 0.1 0.02 0.07jf 0.05 0.95bd 1hs 1pl 1st 0.13 0.87fl 1wo 0.03 0.05 0.92

    Fig. 14. Continuous recognition of human interactions.

    J. Comparison against other methods

    In this section we compare the proposed method with stateof the art methods, recently proposed in the literature, aiming

    to view-independent action recognition using multi-camerasetups. Table IV illustrates comparison results with three meth-ods by evaluating their performance in the i3DPost multi-viewaction recognition database using all the cameras consistingthe database camera setup. In [35], the authors performedthe LOOCV procedure in an action class set consisting ofthe actions walk, run, jump in place, jump forwardand bend. The authors in [36] included action wave onehand in their experimental setup and performed the LOOCVprocedure using six actions and removed action run, inorder to perform the LOOCV procedure by using five actions.Finally, the authors in [37] applied the LOOCV procedure byusing all the eight actions appearing in the databse. As canbe seen in Table IV, the proposed method outperforms all theaforementioned methods.

    TABLE IVCOMPARISON RESULTS IN THE I3DPOST MULTI-VIEW ACTION

    RECOGNITION DATABASE.

    5 actions 8 actions 5 actions 6 actionsMethod [35] 90% - - -Method [37] - 90:88% - -Method [36] - - 97:5% 89:58%Proposed method 94:4% 94:87% 97:8% 95:33%

    In order to compare our method with other methods us-ing the IXMAS action recognition datase we performed theLOOCV procedure by using the binary images provided in thedatabase. In an off-line procedure, each image sequence wassplit in smaller segments, in order to produce action videos.Subsequently, the LOOCV procedure has been performed byusing different SOM topologies and the Bayessian frameworkapproach. By using a 13 13 SOM an action recognition rateequal to 89:8% has been obtained. Table V illustrates compar-ison results with three methods evaluating their performancein the IXMAS multi-view action recognition database. Ascan be seen, the proposed method outperforms these methodsproviding up to 8:5% improvement on the action classificationaccuracy.

    TABLE VCOMPARISON RESULTS IN THE IXMAS MULTI-VIEW ACTION

    RECOGNITION DATABASE.

    Method [38] Method [39] Method [40] Proposed method81:3% 81% 80:6% 89:8%

    IV. CONCLUSIONA very powerful framework based on Artificial Neural Net-

    works has been proposed for action recognition. The proposedmethod highlights the strength of ANN in representing andclassifying visual information. SOM human body posture rep-resentations is combined with Multilayer Perceptrons. Actionand viewing angle classification is achieved independently forall cameras. A Bayesian framework is exploited in order toprovide the optimal combination of the action classificationresults, coming from all available cameras. The effectivenessof the proposed method in challenging problem setups has

  • 12

    been demonstrated by experimentation. According to authorsknowledge, there is no other method in the literature that candeal with all the presented challenges in action recognition.Furthermore, it has been shown that the same framework canbe applied for human interaction recognition between persons,without any modification.

    ACKNOWLEDGMENT

    The research leading to these results has received fundingfrom the Collaborative European Project MOBISERV FP7-248434 (http://www.mobiserv.eu), An Integrated IntelligentHome Environment for the Provision of Health, Nutrition andMobility Services to the Elderly.

    REFERENCES

    [1] L. Weilun, H. Jungong, and P. With, Flexible human behavior analysisframework for video surveillance applications, International Journal ofDigital Multimedia Broadcasting, vol. 2010, Article ID 920121, 9 pages,2010.

    [2] P. Barr, J. Noble, and R. Biddle, Video game values: Human-computerinteraction and games, Interacting with Computers, vol. 19, no. 2, pp.180195, Mar. 2007.

    [3] B. Song, E. Tuncel, and A. Chowdhury, Towards a multi-terminal videocompression algorithm by integrating distributed source coding withgeometrical constraints, Journal of Multimedia, vol. 2, no. 3, pp. 916, 2007.

    [4] T. Hfllerer, S. Feiner, D. Hallaway, B. Bell, M. Lanzagorta, D. Brown,S. Julier, and Y. B. nad L. Rosenblum, User interface managementtechniques for collaborative mobile augmented reality, Computers andGraphics, vol. 25, no. 5, pp. 799810, Oct. 2001.

    [5] S. Ali and M. Shah, Human action recognition in videos usingkinematic features and multiple instance learning, Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 32, no. 2, pp. 288303,feb. 2010.

    [6] J. Hoey and J. Little, Representation and recognition of complex humanmotion, in Proceedings of IEEE Conference on Computer Vision, vol. 1.IEEE, 2000, pp. 752759.

    [7] Y. Wang and G. Mori, Hidden part models for human action recogni-tion: Probabilistic versus max margin, Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 33, no. 7, pp. 13101323, july2011.

    [8] H. J. Seo and P. Milanfar, Action recognition from one example,Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 33, no. 5, pp. 867882, May 2011.

    [9] M. Giese and T. Poggio, Neural mechanisms for the recognition ofbiological movements, Nature Reviews Neuroscience, vol. 4, no. 3, pp.179192, Mar. 2003.

    [10] N. Gkalelis, A. Tefas, and I. Pitas, Combining fuzzy vector quantizationwith linear discriminant analysis for continuous human movementrecognition, IEEE Transactions on Circuits Systems Video Technology,vol. 18, no. 11, pp. 15111521, Nov. 2008.

    [11] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, Actions asspace-time shapes, IEEE Transactions on Pattern Analysis and MachineIntelligence, pp. 22472253, 2007.

    [12] A. Iosifidis, A. Tefas, and I. Pitas, Activity based person identificationusing fuzzy representation and discriminant learning, IEEE Transac-tions on Information Forensics and Security, no. 99.

    [13] M. Ursino, E. Magosso, and C. Cuppini, Recognition of abstract objectsvia neural oscillators: interaction among topological organization, asso-ciative memory and gamma band synchronization, IEEE Transactionson Neural Networks, vol. 20, no. 2, pp. 316335, 2009.

    [14] M. Schmitt, On the sample complexity of learning for networks of spik-ing neurons with nonlinear synaptic interactions, IEEE Transactions onNeural Networks, vol. 15, no. 5, pp. 9951001, 2004.

    [15] B. Ruf and M. Schmitt, Self-organization of spiking neurons usingaction potential timing, IEEE Transactions on Neural Networks, vol. 9,no. 3, pp. 575578, 1998.

    [16] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: Alocal svm approach, in Proceedings of International Conference onPattern Recognition, vol. 3. IEEE, pp. 3236.

    [17] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, A biologically inspiredsystem for action recognition, Proceedings of IEEE InternationalConference on Computer Vision, vol. 1, 2007.

    [18] A. Iosifidis, A. Tefas, N. Nikolaidis, and I. Pitas, Multi-view humanmovement recognition based on fuzzy distances and linear discriminantanalysis, Computer Vision and Image Understanding, 2011.

    [19] S. Yu, D. Tan, and T. Tan, Modeling the effect of view angle variationon appearance-based gait recognition, in Proceedings Asian Conf.Computer Vision, vol. 1, Jan. 2006, pp. 807816.

    [20] O. Masoud and N. Papanikolopoulos, A method for human actionrecognition, Image and Vision Computing, vol. 21, no. 8, pp. 729743,Aug. 2003.

    [21] M. Ahmad and S. Lee, Human action recognition using shape and clg-motion flow from multi-view image sequences, Pattern Recognition,vol. 41, no. 7, pp. 22372252, July 2008.

    [22] D. Weinland, R. Ronfard, and E. Boyer, Free viewpoint action recog-nition using motion history volumes, Computer Vision and ImageUnderstanding, vol. 104, no. 23, pp. 249257, Nov./Dec. 2006.

    [23] M. Ahmad and S. Lee, Hmm-based human action recognition usingmultiview image sequences, Proceedings of IEEE International Con-ference on Pattern Recognition, vol. 1, pp. 263266, 2006.

    [24] F. Qureshi and D. Terzopoulos, Surveillance camera scheduling: Avirtual vision approach, in Proceedings Third ACM International Work-shop on Video Surveillance and Sensor Networks, vol. 12, Nov. 2005,pp. 269283.

    [25] Y. Benezeth, P. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, Re-view and evaluation of commonly-implemented background subtractionalgorithms, in 19th International Conference on Pattern Recognition,2008. ICPR 2008. IEEE, 2009, pp. 14.

    [26] M. Piccardi, Background subtraction techniques: a review, in IEEEInternational Conference on Systems, Man and Cybernetics, vol. 4.Ieee, 2005, pp. 30993104.

    [27] T. Kohonen, The self-organizing map, Proceedings of the IEEE,vol. 78, no. 9, pp. 14641480, 2002.

    [28] S. Haykin, Neural networks and learning machines, Upper SaddleRiver, New Jersey, 2008.

    [29] Y. Le Cun, Efficient learning and second order methods, in Tutorialpresented at Neural Information Processing Systems, vol. 5, 1993.

    [30] P. Werbos, Beyond regression: New tools for prediction and analysisin the behavioral sciences, 1974.

    [31] J. Kittler, M. Hatef, R. Duin, and J. Matas, On combining classifiers,Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 20, no. 3, pp. 226239, 1998.

    [32] N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, The i3dpostmulti-view and 3d human action/interaction database, in 6th Conferenceon Visual Media Production, Nov. 2009, pp. 159168.

    [33] N. Papadakis and A. Bugeau, Tracking with occlusions via graph cuts,Transactions on Pattern Analysis and Machine Intelligence, pp. 144157, 2010.

    [34] O. Lanz, Approximate bayesian multibody tracking, Transactions onPattern Analysis and Machine Intelligence, pp. 14361449, 2006.

    [35] N. Gkalelis, N. Nikolaidis, and I. Pitas, View indepedent humanmovement recognition from multi-view video exploiting a circularinvariant posture representation, in IEEE International Conference onMultimedia and Expo. IEEE, 2009, pp. 394397.

    [36] M. B. Holte, T. B. Moeslund, N. Nikolaidis, and I. Pitas, 3D Hu-man Action Recognition for Multi-View Camera Systems, in FirstJoint 3D Imaging Modeling Processing Visualization Transmission(3DIM/3DPVT) Conference. IEEE, 2011.

    [37] A. Iosifidis, N. Nikolaidis, and I. Pitas, Movement recognition exploit-ing multi-view information, in International Workshop on MultimediaSignal Processing. IEEE, 2010, pp. 427431.

    [38] D. Weinland, E. Boyer, and R. Ronfard, Action recognition fromarbitrary views using 3d exemplars, in Proceedings International Con-ference Computer Vision. IEEE, 2007, pp. 17.

    [39] D. Tran and A. Sorokin, Human activity recognition with metriclearning, Computer VisionECCV 2008, pp. 548561, 2008.

    [40] F. Lv and R. Nevatia, Single view human action recognition usingkey pose matching and viterbi path searching, in IEEE Conference onComputer Vision and Pattern Recognition. IEEE, 2007, pp. 18.