3D Convolutional Neural Networks for Human

8/12/2019 3D Convolutional Neural Networks for Human

1/14

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

3D Convolutional Neural Networks for HumanAction Recognition

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu

Abstract We consider the automated recognition of human actions in surveillance videos. Most current methods buildclassiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs)are a type of deep models that can act directly on the raw inputs. However, such models are currently limited to handle2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features fromboth the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion informationencoded in multiple adjacent frames. The developed model generates multiple channels of information from the inputframes, and the nal feature representation combines information from all channels. To further boost the performanceof 3D CNN models, we propose to regularize the models with high-level features and combine the outputs of a varietyof different models. We apply the developed models to recognize human actions in the real-world environment of airportsurveillance videos, and it achieves superior performance in comparison to baseline methods.

Index Terms deep learning, convolutional neural networks, 3D convolution, model combination, action recognition

1 INTRODUCTION

R ECOGNIZING human actions in the real-worldenvironment nds applications in a varietyof domains including intelligent video surveil-lance, customer attributes, and shopping behav-ior analysis. However, accurate recognition of ac-tions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations,etc. [1][11]. Most of the current approaches [12][16] make certain assumptions (e.g., small scale andviewpoint changes) about the circumstances underwhich the video was taken. However, such assump-tions seldom hold in the real-world environment.In addition, most of the methods follow a two-step approach in which the rst step computesfeatures from raw video frames and the second steplearns classiers based on the obtained features.

In real-world scenarios, it is rarely known whatfeatures are important for the task at hand, sincethe choice of features is highly problem-dependent.

S. Ji is with the Department of Computer Science, Old DominionUniversity, Norfolk, VA 23529.E-mail: [email protected]

W. Xu is with Facebook, Inc., Palo Alto, CA 94304.E-mail: [email protected]

M. Yang and K. Yu are with NEC Laboratories America, Inc.,Cupertino, CA 95014.E-mails: {myang,kyu}@sv.nec-labs.com

Especially for human action recognition, differentaction classes may appear dramatically different interms of their appearances and motion patterns.

Deep learning models [17][21] are a class of machines that can learn a hierarchy of features by building high-level features from low-level ones.Such learning machines can be trained using ei-ther supervised or unsupervised approaches, andthe resulting systems have been shown to yieldcompetitive performance in visual object recog-nition [17], [19], [22][24], human action recog-nition [25][27], natural language processing [28],audio classication [29], brain-computer interaction[30], human tracking [31], image restoration [32],denoising [33], and segmentation tasks [34]. Theconvolutional neural networks (CNNs) [17] are atype of deep models in which trainable lters andlocal neighborhood pooling operations are applied

alternatingly on the raw input images, resulting ina hierarchy of increasingly complex features. It has been shown that, when trained with appropriateregularization [35][37], CNNs can achieve superiorperformance on visual object recognition tasks. Inaddition, CNNs have been shown to be invariantto certain variations such as pose, lighting, andsurrounding clutter [38].

As a class of deep models for feature construc-tion, CNNs have been primarily applied on 2Dimages. In this paper, we explore the use of CNNs


2/14


3/14


(a) 2D convolution

t e m p o r a l

(b) 3D convolution

Fig. 1. Comparison of 2D (a) and 3D (b) convolu-tions. In (b) the size of the convolution kernel in thetemporal dimension is 3, and the sets of connectionsare color-coded so that the shared weights are in thesame color. In 3D convolution, the same 3D kernelis applied to overlapping 3D cubes in the input videoto extract motion features.

where tanh( ) is the hyperbolic tangent function, bijis the bias for this feature map, m indexes over theset of feature maps in the (i 1)th layer connectedto the current feature map, w pqijk is the value at theposition ( p, q ) of the kernel connected to the kth fea-ture map, and P i and Q i are the height and width of the kernel, respectively. In the subsampling layers,the resolution of the feature maps is reduced bypooling over local neighborhood on the featuremaps in the previous layer, thereby enhancing theinvariance to distortions on the inputs. A CNN

architecture can be constructed by stacking multiplelayers of convolution and subsampling in an alter-nating fashion. The parameters of CNN, such asthe bias bij and the kernel weight w pqijk , are usuallylearned using either supervised or unsupervisedapproaches [17], [22].

2.1 3D Convolution

In 2D CNNs, convolutions are applied on the 2Dfeature maps to compute features from the spatial

dimensions only. When applied to video analysisproblems, it is desirable to capture the motion in-formation encoded in multiple contiguous frames.To this end, we propose to perform 3D convolutionsin the convolution stages of CNNs to compute fea-tures from both spatial and temporal dimensions.The 3D convolution is achieved by convolving a 3Dkernel to the cube formed by stacking multiple con-tiguous frames together. By this construction, thefeature maps in the convolution layer is connectedto multiple contiguous frames in the previous layer,thereby capturing motion information. Formally,the value at position (x,y,z ) on the j th feature mapin the ith layer is given by

vxyzij =tanh bij +m

P i 1

p=0

Q i 1

q=0

R i 1

r =0w pqrijm v

(x + p)( y+ q)( z + r )(i 1)m ,

(2)

where R i is the size of the 3D kernel along thetemporal dimension, w pqrijm is the ( p, q, r )th value of the kernel connected to the m th feature map inthe previous layer. A comparison of 2D and 3Dconvolutions is given in Figure 1.

Note that a 3D convolutional kernel can only ex-tract one type of features from the frame cube, sincethe kernel weights are replicated across the entirecube. A general design principle of CNNs is that thenumber of feature maps should be increased in latelayers by generating multiple types of features fromthe same set of lower-level feature maps. Similarto the case of 2D convolution, this can be achieved by applying multiple 3D convolutions with distinctkernels to the same location in the previous layer(Figure 2).

2.2 A 3D CNN Architecture

Based on the 3D convolution described above, avariety of CNN architectures can be devised. Inthe following, we describe a 3D CNN architecturethat we have developed for human action recogni-tion on the TRECVID data set. In this architecture

shown in Figure 3, we consider 7 frames of size60 40 centered on the current frame as inputsto the 3D CNN model. We rst apply a set of hardwired kernels to generate multiple channels of information from the input frames. This results in33 feature maps in the second layer in 5 differentchannels denoted by gray, gradient-x, gradient-y,optow-x, and optow-y. The gray channel con-tains the gray pixel values of the 7 input frames.The feature maps in the gradient-x and gradient-y channels are obtained by computing gradients


4/14


t e m p o r a l

Fig. 2. Extraction of multiple features from contigu-ous frames. Multiple 3D convolutions can be appliedto contiguous frames to extract multiple features. Asin Figure 1, the sets of connections are color-codedso that the shared weights are in the same color.Note that all the 6 sets of connections do not share

weights, resulting in two different feature maps onthe right.

along the horizontal and vertical directions, re-spectively, on each of the 7 input frames, andthe optow-x and optow-y channels contain theoptical ow elds, along the horizontal and verticaldirections, respectively, computed from adjacentinput frames. This hardwired layer is employedto encode our prior knowledge on features, andthis scheme usually leads to better performance ascompared to the random initialization.

We then apply 3D convolutions with a kernel sizeof 7 7 3 (7 7 in the spatial dimension and 3 inthe temporal dimension) on each of the 5 channelsseparately. To increase the number of feature maps,two sets of different convolutions are applied ateach location, resulting in 2 sets of feature maps inthe C2 layer each consisting of 23 feature maps. Inthe subsequent subsampling layer S3, we apply 2 2subsampling on each of the feature maps in the C2

layer, which leads to the same number of featuremaps with a reduced spatial resolution. The nextconvolution layer C4 is obtained by applying 3Dconvolution with a kernel size of 7 6 3 on eachof the 5 channels in the two sets of feature mapsseparately. To increase the number of feature maps,we apply 3 convolutions with different kernels ateach location, leading to 6 distinct sets of featuremaps in the C4 layer each containing 13 featuremaps. The next layer S5 is obtained by applying3 3 subsampling on each feature maps in the C4

layer, which leads to the same number of featuremaps with a reduced spatial resolution. At thisstage, the size of the temporal dimension is alreadyrelatively small (3 for gray, gradient-x, gradient-yand 2 for optow-x and optow-y), so we performconvolution only in the spatial dimension at thislayer. The size of the convolution kernel used is7 4 so that the sizes of the output feature maps arereduced to 1 1. The C6 layer consists of 128 featuremaps of size 1 1, and each of them is connectedto all the 78 feature maps in the S5 layer.

After the multiple layers of convolution and sub-sampling, the 7 input frames have been convertedinto a 128D feature vector capturing the motioninformation in the input frames. The output layerconsists of the same number of units as the numberof actions, and each unit is fully connected to eachof the 128 units in the C6 layer. In this design we

essentially apply a linear classier on the 128D fea-ture vector for action classication. All the trainableparameters in this model are initialized randomlyand trained by the online error back-propagation al-gorithm as described in [17]. We have designed andevaluated other 3D CNN architectures that combinemultiple channels of information at different stages,and our results show that this architecture gives the best performance.

2.3 Model Regularization

The inputs to 3D CNN models are limited to asmall number of contiguous video frames due tothe increased number of trainable parameters asthe size of input window increases. On the otherhand, many human actions span a number of frames. Hence, it is desirable to encode high-levelmotion information into the 3D CNN models. Tothis end, we propose to compute motion featuresfrom a large number of frames and regularize the3D CNN models by using these motion featuresas auxiliary outputs (Figure 4). Similar ideas have

been used in image classication tasks [35][37], butits performance in action recognition is not clear.In particular, for each training action we generatea feature vector encoding the long-term action in-formation beyond the information contained in theinput frame cube to the CNN. We then encouragethe CNN to learn a feature vector close to thisfeature. This is achieved by connecting a numberof auxiliary output units to the last hidden layerof CNN and clamping the computed feature vec-tors on the auxiliary units during training. This


5/14


H1:33@60x40 C2:

23*2@54x34

7x7x3 3Dconvolution

2x2subsampling

S3:23*2@27x17

7x6x3 3Dconvolution

C4:13*6@21x12

3x3subsampling

S5:13*6@7x4

7x4convolution

C6:128@1x1

fullconnnection

hardwired

input:7@60x40

Fig. 3. A 3D CNN architecture for human action recognition. This architecture consists of 1 hardwired layer,3 convolution layers, 2 subsampling layers, and 1 full connection layer. Detailed descriptions are given inthe text.

will encourage the hidden layer information to be close to the high-level motion feature. Moredetails on this scheme can be found in [35][37].In the experiments, we use the bag-of-words fea-tures constructed from dense SIFT descriptors [40]computed on raw gray images and motion edgehistory images (MEHI) [41] as auxiliary features.Results show that such regularization scheme leadsto consistent performance improvements.

2.4 Model Combination

Based on the 3D convolution operations, a varietyof 3D CNN architectures can be designed. Amongthe architectures considered in this paper, the oneintroduced in Section 2.2 yields the best perfor-mance on the TRECVID data. However, this maynot be the case for other data sets. The selection of optimal architecture for a problem is challenging,since this depends on the specic applications. Analternative approach is to construct multiple mod-els and combine the outputs of these models formaking predictions [42][44]. This scheme has also

been used in combining traditional neural networks[45]. However, the effect of model combination inthe context of convolutional neural networks foraction recognition has not been investigated. Inthis paper, we propose to construct multiple 3DCNN models with different architectures, hencecapturing potentially complementary informationfrom the inputs. In the prediction phase, the inputis given to each model and the outputs of thesemodels are then combined. Experimental resultsdemonstrate that this model combination scheme

3D-CNN

Actionclasses

Auxiliaryoutputs

Auxiliaryfeature

extractors

Fig. 4. The regularized 3D CNN architecture.

is very effective in boosting the performance of 3DCNN models on action recognition tasks.

2.5 Model Implementation

The 3D CNN models are implemented in C++ aspart of NECs human action recognition system[25]. The implementation details are based on those

of the original CNN as described in [17], [46]. Allthe subsampling layers apply max sampling asdescribed in [47]. The overall loss function used totrain the regularized models is a weighted sum-mation of the loss functions induced by the trueaction classes and the auxiliary outputs. The weightfor the true action classes is set to 1 and that forthe auxiliary outputs is set to 0.005 empirically. Allthe model parameters are randomly initialized asin [17], [46] and are trained using the stochasticdiagonal Levenberg-Marquardt method [17], [46].


6/14


TABLE 1The number of samples on the ve dates extracted from the TRECVID 2008 development data set.

DATE \ CLASS CEL LTOEAR OBJECTPUT POINTING NEGATIVE TOTAL20071101 2692 1349 7845 20056 3194220071106 1820 3075 8533 22095 3552320071107 465 3621 8708 19604 3239820071108 4162 3582 11561 35898 5520320071112 4859 5728 18480 51428 80495

TOTAL 13998 17355 55127 149081 235561

In this method, a learning rate is computed foreach parameter using the diagonal terms of anestimate of the Gauss-Newton approximation to theHessian matrix on 1000 randomly sampled traininginstances.

3 R ELATED WORK

CNNs belong to the class of biologically inspiredmodels for visual recognition, and some other vari-ants have also been developed within this family.Motivated by the organization of visual cortex, asimilar model, called HMAX [48], has been devel-oped for visual object recognition. In the HMAXmodel, a hierarchy of increasingly complex featuresare constructed by the alternating applications of template matching and max pooling. In particular,at the S1 layer a still input image is rst analyzed by an array of Gabor lters at multiple orientationsand scales. The C1 layer is then obtained by poolinglocal neighborhoods on the S1 maps, leading toincreased invariance to distortions on the input.The S2 maps are obtained by comparing C1 mapswith an array of templates, which were generatedrandomly from C1 maps in the training phase. Thenal feature representation in C2 is obtained byperforming global max pooling over each of the S2maps.

The original HMAX model is designed to analyze2D images. In [16] this model has been extended

to recognize actions in video data. In particular,the Gabor lters in S1 layer of the HMAX modelhave been replaced with some gradient and space-time modules to capture motion information. Inaddition, some modications to HMAX, proposedin [49], have been incorporated into the model. Amajor difference between CNN- and HMAX-basedmodels is that CNNs are fully trainable systemsin which all the parameters are adjusted based ontraining data, while all modules in HMAX consistof hard-coded parameters.

In speech and handwriting recognition, time-delay neural networks have been developed toextract temporal features [50]. In [51], a modiedCNN architecture has been developed to extractfeatures from video data. In addition to recognitiontasks, CNNs have also been used in 3D imagerestoration problems [32].

4 E XPERIMENTSWe focus on the TRECVID 2008 data to evaluate thedeveloped 3D CNN models for action recognitionin surveillance videos. Meanwhile, we also performexperiments on the KTH data [13] to compare withprevious methods.

4.1 Action Recognition on TRECVID Data

The TRECVID 2008 development data set consistsof 49-hour videos captured at London Gatwick Air-port using 5 different cameras with a resolution of 720 576 at 25 fps. The videos recorded by cameranumber 4 are excluded as few events occurred inthis scene. In the current experiments, we focuson the recognition of 3 action classes ( CellToEar,ObjectPut, and Pointing). Each action is classied inthe one-against-rest manner, and a large number of negative samples were generated from actions thatare not in these 3 classes. This data set was cap-tured on ve days (20071101, 20071106, 20071107,20071108, and 20071112), and the statistics of thedata used in our experiments are summarized in

Table 1. Multiple 3D CNN models are evaluatedin this experiment, including the one described inFigure 3.

As the videos were recorded in real-world en-vironments, and each frame contains multiple hu-mans, we apply a human detector and a detection-driven tracker to locate human heads. The detailedprocedure for tracking is described in [52], andsome sample results are shown in Figure 5. Basedon the detection and tracking results, a bound-ing box for each human that performs action was


7/14


8/14


9/14


FPR=0.1% FPR=1%0

0.2

0.4

0.6

0.8

P r e c

i s i o n

FPR=0.1% FPR=1%

0

0.02

0.04

0.06

0.08

0.1

0.12

R e c a

l l

3DRCNN332s

3DCNN332s

2DCNN

SPMgraycube

SPMMEHIcube

SPMgray+MEHIcube

TISR

FPR=0.1% FPR=1%0

0.2

0.4

0.6

0.8

1

1.2

1.4

S c a

l e d A U C

Fig. 7. Average performance comparison of the seven methods under different false positive rates (FPR).The AUC scores at FPR=0.1% and 1% are multiplied by 105 and 103, respectively, for better visualization.

FPR=0.1% FPR=1%0

0.2

0.4

0.6

0.8

P r e c

i s i o n

FPR=0.1% FPR=1%0

0.02

0.04

0.06

0.08

0.1

0.12

R

e c a

l l

3DCNN332s

3DCNN332m

3DCNN322m

3DCNN222

m

FPR=0.1% FPR=1%0

0.5

1

1.5

S c a

l e d A U C

Fig. 8. Average performance comparison of the four different 3D CNN architectures under different falsepositive rates (FPR). The AUC scores at FPR=0.1% and 1% are multiplied by 105 and 103, respectively, forbetter visualization.

in all cases. Although the improvement by theregularized model is not signicant, the followingexperiments show that signicant performance im-provements can be obtained by combining the twomodels.

To evaluate the effectiveness of model combina-tion in the context of CNN for action recognition,we develop three alternative 3D CNN architecturesdescribed below.

3D-CNN m332 denotes the architecture in whichthe different channels are mixed, and therst two convolutional layers use 3D convo-lution, and the last layer use 2D convolution.

mixed means that the channels of the sametype (i.e., gradient-x and gradient-y, optow-x and optow-y) are convolved separately, butthey are connected to the same set of featureplanes in the rst convolutional layer. In thesecond convolutional layer, all ve channelsare connected to the same set of feature planes.In contrast, for models with superscript s, allve channels are connected to separate featureplanes in all layers.

3D-CNN m322 denotes a model similar to

3D-CNN m332, but only the rst convolutionallayer use 3D convolution, and the other twolayers use 2D convolution.

3D-CNN m222 denotes a model similar to3D-CNN m332, but all three convolutional layersuse 2D convolution.

The average performance of these three modelsalong with that of 3D-CNN s332 is plotted in Figure8. We can observe that the performance of thesethree alternative architecture is lower than that of 3D-CNN s332. However, we show in the followingthat combination of these models can lead to sig-nicant performance improvement.

To evaluate the effectiveness of modelcombination, we tuned each of the vemodels (3D-RCNN s332, 3D-CNN

s332, 3D-CNN

m222,

3D-CNN m322, and 3D-CNNm332) individually and

then combine their outputs to make prediction.We combine models incrementally in order of decreasing individual performance. That is,the models are sorted in a decreasing order of individual performance and they are combinedincrementally from the rst to the last. Thereason for doing this is that, we expect individual


10/14


11/14


FPR=0.1% FPR=1%0

0.2

0.4

0.6

0.8

P r e c

i s i o n

FPR=0.1% FPR=1%0

0.05

0.1

0.15

0.2

R e c a

l l

11+21+2+31+2+3+41+2+3+4+5

FPR=0.1% FPR=1%0

0.5

1

1.5

2

2.5

S c a

l e d A U C

Fig. 9. Performance of different combinations of the 3D CNN architectures. The AUC scores at FPR=0.1%and 1% are multiplied by 105 and 103 , respectively, for better visualization. See the caption of Table 3 andtexts for detailed explanations.

TABLE 4Comparison of the best performance achieved by our previous methods in [26] (ICML models) with the

new methods proposed in this paper (New models).Precision Recall AUC

FPR 0.1% 1% 0.1% 1% 0.1% 1%New models 0.7824 0.6149 0.0340 0.1433 0.0201 0.8853ICML models 0.7137 0.5572 0.0230 0.1132 0.0129 0.6752

converted into 128D feature vectors. The nal layerconsists of 6 units corresponding to the 6 classes.

As in [16], we use the data for 16 randomlyselected subjects for training, and the data for theother 9 subjects for testing. Majority voting is usedto produce labels for a video sequence based on thepredictions for individual frames. The recognitionperformance averaged across 5 random trials isreported in Table 5 along with published resultsin the literature. The 3D CNN model achievesan overall accuracy of 90.2% as compared with91.7% achieved by the HMAX model. Note that theHMAX model use handcrafted features computedfrom raw images with 4-fold higher resolution.Also, some of the methods in Table 5 used differenttraining/test splits of the data.

5 C ONCLUSIONS AND DISCUSSIONS

We developed 3D CNN models for action recog-nition in this paper. These models construct fea-tures from both spatial and temporal dimensions byperforming 3D convolutions. The developed deeparchitecture generates multiple channels of infor-mation from adjacent input frames and performconvolution and subsampling separately in eachchannel. The nal feature representation is obtained by combining information from all channels. Wedeveloped model regularization and combination

schemes to further boost the model performance.We evaluated the 3D CNN models on the TRECVIDand the KTH data sets. Results show that the 3DCNN model outperforms compared methods onthe TRECVID data, while it achieves competitiveperformance on the KTH data, demonstrating itssuperior performance in real-world environments.

In this work, we considered the CNN model foraction recognition. There are also other deep archi-tectures, such as the deep belief networks [19], [23],which achieve promising performance on objectrecognition tasks. It would be interesting to extendsuch models for action recognition. The developed3D CNN model was trained using supervised algo-rithm in this work, and it requires a large numberof labeled samples. Prior studies show that thenumber of labeled samples can be signicantlyreduced when such model is pre-trained using

unsupervised algorithms [22]. We will explore theunsupervised training of 3D CNN models in thefuture.

R EFERENCES[1] I. Laptev and T. Lindeberg, Space-time interest points,

in Proceedings of the IEEE International Conference on Com- puter Vision, 2003, pp. 432439.

[2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld,Learning realistic human actions from movies, in Pro-ceedings of the 2008 IEEE Conference on Computer Vision andPattern Recognition, 2008.


12/14


Fig. 10. Sample actions in the CellToEar class. The top row shows actions that are correctly recognized bythe combined 3D CNN model while the bottom row shows those that are misclassied by the model.

Fig. 11. Sample actions in the ObjectPut class. The top row shows actions that are correctly recognized bythe combined 3D CNN model while the bottom row shows those that are misclassied by the model.

Fig. 12. Sample actions in the Pointing class. The top row shows actions that are correctly recognized bythe combined 3D CNN model while the bottom row shows those that are misclassied by the model.

[3] J. Liu, J. Luo, and M. Shah, Recognizing realistic actionsfrom videos, in Proceedings of the 2009 IEEE Conference onComputer Vision and Pattern Recognition, 2009, pp. 19962003.

[4] Y. Wang and G. Mori, Max-margin hidden conditionalrandom elds for human action recognition, in Proceed-ings of the 2009 IEEE Conference on Computer Vision andPattern Recognition, 2009, pp. 872879.

[5] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce,

Automatic annotation of human actions in video, inProceedings of the 12th International Conference on ComputerVision, 2009, pp. 14911498.

[6] Y. Wang and G. Mori, Hidden part models for human ac-tion recognition: Probabilistic versus max margin, IEEETransactions on Pattern Analysis and Machine Intelligence,2011.

[7] H. Wang, M. M. Ullah, A. Kl aser, I. Laptev, and C. Schmid,Evaluation of local spatio-temporal features for actionrecognition, in British Machine Vision Conference, 2009, p.127.

[8] M. Marszalek, I. Laptev, and C. Schmid, Actions incontext, in IEEE Conference on Computer Vision and PatternRecognition, 2009, pp. 29292936.

[9] I. Junejo, E. Dexter, I. Laptev, and P. P erez, View-independent action recognition from temporal self-similarities, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 172185, 2011.

[10] V. Delaitre, I. Laptev, and J. Sivic, Recognizing humanactions in still images: a study of bag-of-features and part- based representations, in Proceedings of the 21st British Machine Vision Conference, 2010.

[11] Q. Le, W. Zou, S. Yeung, and A. Ng, Learning hierarchi-

cal invariant spatio-temporal features for action recogni-tion with independent subspace analysis, in Proceedingsof the 2011 IEEE Conference on Computer Vision and PatternRecognition, 2011, pp. 33613368.

[12] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, Recogniz-ing action at a distance, in Proceedings of the Ninth IEEEInternational Conference on Computer Vision, 2003, pp. 726733.

[13] C. Schuldt, I. Laptev, and B. Caputo, Recognizing humanactions: A local SVM approach, in Proceedings of the 17thInternational Conference on Pattern Recognition, 2004, pp.3236.

[14] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, Be-havior recognition via sparse spatio-temporal features, in


13/14


TABLE 5Action recognition accuracies in percentage on the KTH data. Note that we use the same training/test split

as [16], and other methods use different splits.Method Boxing Handclapping Handwaving Jogging Running Walking Average3D CNN 90 94 97 84 79 97 90.2Schuldt [13] 97.9 59.7 73.6 60.4 54.9 83.8 71.7Dollar [14] 93 77 85 57 85 90 81.2Niebles [56] 98 86 93 53 88 82 83.3

Jhuang [16] 92 98 92 85 87 96 91.7Schindler [53] 92.7

Second Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillance,2005, pp. 6572.

[15] I. Laptev and P. P erez, Retrieving actions in movies, inProceedings of the 11th International Conference on ComputerVision, 2007, pp. 18.

[16] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, A biologicallyinspired system for action recognition, in Proceedings of the 11th International Conference on Computer Vision, 2007,

pp. 18.[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient- based learning applied to document recognition, Proceed-ings of the IEEE, vol. 86, no. 11, pp. 22782324, November1998.

[18] G. E. Hinton and R. R. Salakhutdinov, Reducing thedimensionality of data with neural networks, Science,vol. 313, no. 5786, pp. 504507, July 2006.

[19] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning al-gorithm for deep belief nets, Neural Computation, vol. 18,pp. 15271554, 2006.

[20] Y. Bengio, Learning deep architectures for AI, Founda-tions and Trends in Machine Learning, vol. 2, no. 1, pp. 1127, 2009.

[21] Y. Bengio and Y. LeCun, Scaling learning algorithmstowards AI, in Large-Scale Kernel Machines, L. Bottou,O. Chapelle, D. DeCoste, and J. Weston, Eds. MIT Press,2007.

[22] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun, Un-supervised learning of invariant feature hierarchies withapplications to object recognition, in Proceedings of theComputer Vision and Pattern Recognition Conference, 2007.

[23] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, Convo-lutional deep belief networks for scalable unsupervisedlearning of hierarchical representations, in Proceedings of the 26th Annual International Conference on Machine Learn-ing, 2009, pp. 609616.

[24] M. Norouzi, M. Ranjbar, and G. Mori, Stacks of convo-lutional restricted Boltzmann machines for shift-invariant

feature learning, in Proceedings of the 2009 IEEE Conferenceon Computer Vision and Pattern Recognition, 2009.[25] M. Yang, S. Ji, W. Xu, J. Wang, F. Lv, K. Yu, Y. Gong,

M. Dikmen, D. J. Lin, and T. S. Huang, Detecting humanactions in surveillance videos, in TREC Video RetrievalEvaluation Workshop, 2009.

[26] S. Ji, W. Xu, M. Yang, and K. Yu, 3D convolutional neuralnetworks for human action recognition, in Proceedingsof the Twenty-Seventh International Conference on MachineLearning, 2010, pp. 495502.

[27] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, Con-volutional learning of spatio-temporal features, in Pro-ceedings of the 11th European conference on Computer vision,2010, pp. 140153.

[28] R. Collobert and J. Weston, A unied architecture fornatural language processing: deep neural networks withmultitask learning, pp. 160167, 2008.

[29] H. Lee, P. Pham, Y. Largman, and A. Ng, Unsupervisedfeature learning for audio classication using convolu-tional deep belief networks, in Advances in Neural Infor-mation Processing Systems 22, 2009, pp. 10961104.

[30] H. Cecotti and A. Graser, Convolutional neural networksfor P300 detection with application to brain-computer in-

terfaces, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 33, pp. 433445, 2011.[31] J. Fan, W. Xu, Y. Wu, and Y. Gong, Human tracking

using convolutional neural networks, IEEE Transactionson Neural Networks, vol. 21, pp. 16101623, October 2010.

[32] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L.Briggman, M. N. Helmstaedter, W. Denk, and H. S. Seung,Supervised learning of image restoration with convolu-tional networks, in Proceedings of the 11th InternationalConference on Computer Vision, 2007.

[33] V. Jain and S. Seung, Natural image denoising withconvolutional networks, in Advances in Neural InformationProcessing Systems 21, D. Koller, D. Schuurmans, Y. Bengio,and L. Bottou, Eds., 2009, pp. 769776.

[34] S. C. Turaga, J. F. Murray, V. Jain, F. Roth, M. Helmstaedter,K. Briggman, W. Denk, and H. S. Seung, Convolutionalnetworks can learn to generate afnity graphs for imagesegmentation, Neural Computation, vol. 22, no. 2, pp. 511538, 2010.

[35] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, Traininghierarchical feed-forward visual recognition models usingtransfer learning from pseudo-tasks, in Proceedings of the10th European Conference on Computer Vision, 2008, pp. 6982.

[36] K. Yu, W. Xu, and Y. Gong, Deep learning with kernelregularization for visual recognition, in Advances in Neu-ral Information Processing Systems 21, D. Koller, D. Schuur-mans, Y. Bengio, and L. Bottou, Eds., 2009, pp. 18891896.

[37] H. Mobahi, R. Collobert, and J. Weston, Deep learning

from temporal coherence in video, in Proceedings of the26th Annual International Conference on Machine Learning,2009, pp. 737744.

[38] Y. LeCun, F. Huang, and L. Bottou, Learning methodsfor generic object recognition with invariance to poseand lighting, in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, 2004.

[39] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou,and P. Barbano, Toward automatic phenotyping of devel-oping embryos from videos, IEEE Transactions on ImageProcessing, vol. 14, no. 9, pp. 13601371, 2005.

[40] D. G. Lowe, Distinctive image features from scale in-variant keypoints, International Journal of Computer Vision,vol. 60, no. 2, pp. 91110, 2004.


14/14


[41] M. Yang, F. Lv, W. Xu, K. Yu, and Y. Gong, Human actiondetection by boosting efcient motion features, in IEEEWorkshop on Video-oriented Object and Event Classication,2009.

[42] L. Breiman, Bagging predictors, Machine Learning,vol. 24, pp. 123140, 1996.

[43] Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proceedings of the Thirteenth Inter-national Conference on Machine Learning, 1996, pp. 148156.

[44] J. Kittler, M. Hatef, R. P. Duin, and J. Matas, On combin-ing classiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, pp. 226239, 1998.

[45] L. K. Hansen and P. Salamon, Neural network ensem- bles, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 12, pp. 9931001, October 1990.

[46] Y. LeCun, L. Bottou, G. Orr, and K. Muller, Efcient backprop, in Neural Networks: Tricks of the trade, G. Orrand M. K., Eds. Springer, 1998.

[47] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun,What is the best multi-stage architecture for object recog-nition? in Proceedings of the 12th International Conferenceon Computer Vision. IEEE, 2009.

[48] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Pog-

gio, Object recognition with cortex-like mechanisms,IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 29, no. 3, pp. 411426, 2007.

[49] J. Mutch and D. G. Lowe, Object class recognition andlocalization using sparse features with limited receptiveelds, International Journal of Computer Vision, vol. 80,no. 1, pp. 4557, October 2008.

[50] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah,Signature verication using a Siamese time delay neuralnetwork, in Advances in Neural Information ProcessingSystems 6, 1994, pp. 737744.

[51] H.-J. Kim, J. S. Lee, and H.-S. Yang, Human action recog-nition using a modied convolutional neural network,in Proceedings of the 4th International Symposium on NeuralNetworks, 2007, pp. 715723.

[52] M. Yang, F. Lv, W. Xu, and Y. Gong, Detection drivenadaptive multi-cue integration for multiple human track-ing, in Proceedings of the 12th International Conference onComputer Vision, 2009, pp. 15541561.

[53] K. Schindler and L. Van Gool, Action snippets: Howmany frames does human action recognition require? inProceedings of the IEEE International Conference on ComputerVision and Pattern Recognition, 2008.

[54] G. Zhu, M. Yang, K. Yu, W. Xu, and Y. Gong, Detectingvideo events based on action recognition in complexscenes using spatio-temporal descriptor, in Proceedings of the 17th ACM International Conference on Multimedia, 2009,pp. 165174.

[55] S. Lazebnik, C. Achmid, and J. Ponce, Beyond bags

of features: Spatial pyramid matching for recognizingnatural scene categories, in Proceedings of the 2006 IEEEComputer Society Conference on Computer Vision and PatternRecognition, 2006, pp. 21692178.

[56] J. C. Niebles, H. Wang, and L. Fei-Fei, Unsuper-vised learning of human action categories using spatial-temporal words, International Journal of Computer Vision,vol. 79, no. 3, pp. 299318, 2008.

Shuiwang Ji received the Ph.D. degreein computer science from Arizona StateUniversity, Tempe, AZ, in 2010. Currently,he is an Assistant Professor in the De-partment of Computer Science, Old Do-minionUniversity, Norfolk, VA. His researchinterests include machine learning, datamining, and bioinformatics. He received theOutstanding Ph.D. Student Award from Ari-

zona State University in 2010.

Wei Xu received his B.S. degree from Ts-inghua University, Beijing, China, in 1998,and M.S. degree from Carnegie Mel-lon University (CMU), Pittsburgh, USA, in2000. From 1998 to 2001, he was a re-search assistant at the Language Technol-ogy Institute at CMU. In 2001, he joinedNEC Laboratories America working on in-telligent video analysis. He has been a

research scientist in Facebook since November 2009. His re-search interests include computer vision, image and video un-derstanding, machine learning and data mining.

Ming Yang received the B.E. and M.E.degrees in electronic engineering from Ts-inghua University, Beijing, China, in 2001and 2004, respectively, and the Ph.D.degree in electrical and computer en-gineering from Northwestern University,Evanston, Illinois, in June 2008. From 2004to 2008, he was a research assistant inthe computer vision group of Northwestern

University. After his graduation, he joined NEC LaboratoriesAmerica, Cupertino, California, where he is currently a researchstaff member. His research interests include computer vision,machine learning, video communication, large-scale image re-trieval, and intelligent multimedia content analysis. He is amember of the IEEE.

Kai Yu is the head of media analyticsdepartment at NEC Laboratories Amer-

ica, where he manages an R&D divi-sion, responsible for NECs technology in-novation in image recognition, multimediasearch, video surveillance, sensor mining,and human-computer interaction. He haspublished over 70 papers in top-tier confer-ences and journals in the area of machine

learning, data mining, and computer vision. He has served asarea chairs in top-tier machine learning conferences, e.g. ICMLand NIPS, and taught an AI class at Stanford University as avisiting faculty. Before joining NEC, he was a senior researchscientist at Siemens. He received Ph.D in Computer Sciencefrom University of Munich in 2004.

3D Convolutional Neural Networks for Human

Documents

Transcript of 3D Convolutional Neural Networks for Human