Hierarchical Temporal Pooling for Efficient Online Action ...

12
Hierarchical Temporal Pooling for Ef cient Online Action Recognition Can Zhang 1 , Yuexian Zou 1,2(&) , and Guang Chen 1 1 ADSPLAB, School of ECE, Peking University, Shenzhen, China [email protected] 2 Peng Cheng Laboratory, Shenzhen, China Abstract. Action recognition in videos is a dif cult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inef cient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and ef ciency of action recog- nition following the two-stream ConvNets by investigating the effective video- level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effec- tive video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two- stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical ow as one of the inputs are computationally inef cient since calculating optical ow is time-consuming. To improve the ef ciency, in our study, we do not consider using optical ow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise pre- sentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two- stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications. Keywords: Action recognition Hierarchical Temporal Pooling Real-time 1 Introduction Recently, action recognition in videos has already become a challenging and funda- mental problem in computer vision research area, which has potential applications in many areas like intelligent life assistance and video surveillance analysis. Research © Springer Nature Switzerland AG 2019 I. Kompatsiaris et al. (Eds.): MMM 2019, LNCS 11295, pp. 471482, 2019. https://doi.org/10.1007/978-3-030-05710-7_39

Transcript of Hierarchical Temporal Pooling for Efficient Online Action ...

Page 1: Hierarchical Temporal Pooling for Efficient Online Action ...

Hierarchical Temporal Pooling for EfficientOnline Action Recognition

Can Zhang1, Yuexian Zou1,2(&), and Guang Chen1

1 ADSPLAB, School of ECE, Peking University, Shenzhen, [email protected]

2 Peng Cheng Laboratory, Shenzhen, China

Abstract. Action recognition in videos is a difficult and challenging task.Recent developed deep learning-based action recognition methods have achievedthe state-of-the-art performance on several action recognition benchmarks.However, it is noted that these methods are inefficient since they are of largemodel size and require long runtime which restrict their practical applications. Inthis study, we focus on improving the accuracy and efficiency of action recog-nition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundantinformation widely exists in adjacent frames in the videos and humans do notrecognize actions based on frame-level features. Therefore, to extract the effec-tive video-level features, a Hierarchical Temporal Pooling (HTP) module isproposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-levelrepresentations by hierarchically incorporating the temporal motion and spatialappearance features. It is worth noting that all two-stream action recognitionmethods using optical flow as one of the inputs are computationally inefficientsince calculating optical flow is time-consuming. To improve the efficiency, inour study, we do not consider using optical flow but consider only raw RGB asinput to our HTP-Net termed as HTP-Net (RGB) for a clear and concise pre-sentation. Extensive experiments have been conducted on two benchmarks:UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offerscompetitive action recognition accuracy but is approximately 1-2 orders ofmagnitude faster than other state-of-the-art single stream action recognitionmethods. Specifically, our HTP-Net (RGB) runs at 42 videos per second(vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, whichenables real-time action recognition and is of great value in practical applications.

Keywords: Action recognition � Hierarchical Temporal PoolingReal-time

1 Introduction

Recently, action recognition in videos has already become a challenging and funda-mental problem in computer vision research area, which has potential applications inmany areas like intelligent life assistance and video surveillance analysis. Research

© Springer Nature Switzerland AG 2019I. Kompatsiaris et al. (Eds.): MMM 2019, LNCS 11295, pp. 471–482, 2019.https://doi.org/10.1007/978-3-030-05710-7_39

Page 2: Hierarchical Temporal Pooling for Efficient Online Action ...

shows that Convolutional Neural Networks (CNNs) are the most important key playersin image and video processing. The representatives include AlexNet [1], VGG [2],ResNet [3], and GoogleNet [4]. So far, extracting credible spatio-temporal features withCNNs is still an active research topic.

CNN-based architectures for action recognition can be divided into two majorcategories: (1) Two-stream ConvNets: This method decomposes the input video intospatial and temporal streams, which learns appearance and motion features respec-tively. Each stream is trained separately, and various fusion strategies like consensuspooling [5], 3D pooling [6], or trajectory-constrained pooling [7] are applied to fuse theoutputs at the end, aiming to learn spatio-temporal features. Under this framework,Simonyan et al. devised two-stream ConvNets [6] by processing RGB images in spatialstream and stacked optical flow in temporal stream separately. Wang et al. proposedTemporal Segment Network (TSN) [5] to make the two-stream ConvNets go deeperand explore the cross-modality pre-training. TSN greatly outperforms previous tradi-tional methods like improved Dense Trajectories (iDT) [8]. (2) 3D ConvNets: Thismethod applies on long-term input video frames and can aggregate not only the spatialappearance information in each frame but also the temporal transformation acrossneighboring frames. 3D ConvNets using convolutions in time dimension were firstlyintroduced by Baccouche et al. [9] and Ji et al. [10]. Later, Tran et al. applied 3DConvNets [11] on large-scale datasets and further integrated deep ResNet with 3Dconvolutions, which is called Res3D [12].

Although two-stream ConvNets and 3D ConvNets have achieved great success forrecognizing actions in unconstrained scenes, their results are far from meeting theneeds of practical applications. From the perspective of accuracy, two-stream Con-vNets are generally superior to 3D ConvNets because the ultra-deep networks and pre-trained models from large-scale image classification task can be easily applied. Nev-ertheless, calculating optical flow in advance and training two separate streams aretime-consuming. To address this problem, 3D ConvNets encode the spatio-temporalinformation by extending the convolutions and pooling operations from 2D to 3D.However, the training process of 3D ConvNets is more computationally expensive andthe model size is larger compared with two-stream ConvNets. For example, the modelsize of the 33-layer 2D BN-Inception [13] is 39 MB while the model size of the widelyused 11-layer 3D ConvNets (C3D) is 321 MB which is 8 times larger. This “fatal flaw”can’t be ignored for efficiency concern.

Our objective in this paper is to improve the efficiency of action recognition bydevising the Hierarchical Temporal Pooling (HTP) module into two-stream ConvNets,which leads to an extremely efficient action recognition network termed as HTP-Net(Two-stream). Empirically, calculating optical flow is time-consuming for two-streamConvNets. So following the common practice, we only evaluate the efficiency of ourHTP-Net with raw RGB frames as input, namely HTP-Net (RGB). Through theexperiments conducted on two benchmarks, we demonstrated that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers compet-itive action recognition accuracy but is approximately 1-2 orders of magnitude fasterthan other state-of-the-art single stream action recognition methods, which perfectly fitsthe real-time action recognition applications.

472 C. Zhang et al.

Page 3: Hierarchical Temporal Pooling for Efficient Online Action ...

2 Proposed HTP-Net

2.1 Overall Architecture

The network architecture of our proposed HTP-Net is shown in Fig. 1.

Sampling Strategy. Given an input video V consisting of variable numbers of frames.Considering the redundancy existing in consecutive frames and the limitation ofmemory size, we split the entire video into N subsections S1; S2; � � � ; SNf g of equalduration. For each subsection, one frame is randomly chosen, the selected N-frames aredenoted as Fi i ¼ 1; 2; . . .;Nð Þ. This sampling strategy is proved to be effective by manystate-of-the-art methods [5, 14–16], which not only allows the whole video to beprocessed with reasonable computational cost, but also brings appearance diversity dueto the stochastic selection method.

Network Architecture. As shown in Fig. 1, the green 3D HTP modules are sand-wiched between the clusters of blue 2D blocks (weight sharing). The 2D blocks indifferent layers represent the different parts of the 2D ConvNet. For instance, we dividethe BN-Inception architecture [13] into five parts (partition points are in conv2,inception-3c, inception-4e and inception-5b layers). The detailed explanations aregiven in Sect. 3.1. In Fig. 1, the first column of 2D blocks is denoted as “2D blocks_1”which indicates the first part of BN-Inception architecture (until conv2 layer), and soon. Our proposed HTP module is used to merge the temporal motion and spatialappearance information simultaneously. In our design, temporal downsampling isperformed at each HTP module with 3D pooling operation, and hence no additionalparameters are required, making the size of our model becomes rather small.

Fig. 1. The architecture of HTP-Net. The entire video is split into N subsections with equallength, denoted as S1; . . .; SN . One frame is randomly sampled in each subsection. These sampledframes are processed by the clusters of 2D blocks (in blue color) and 3D HTP modules (in greencolor). 2D blocks are applied to yield spatial appearance representations and 3D HTP module isused to merge the temporal motion and spatial appearance information simultaneously. (Colorfigure online)

Hierarchical Temporal Pooling for Efficient Online Action Recognition 473

Page 4: Hierarchical Temporal Pooling for Efficient Online Action ...

As mentioned above, N frames F1;F2; � � � ;FNf g are randomly extracted from thevideo, each of these frames is fed into the 2D blocks respectively, and the 2D blockscompute the appearance features containing spatial information for each frame inde-pendently. Feature maps obtained from the “2D blocks_1” can be denoted as

X 1ð Þ1 ;X 1ð Þ

2 ; � � � ;X 1ð ÞN

n o. Before and after 3D pooling, the feature vectors are permuted in

HTP modules. The permutation details are elaborated in Sect. 2.2. In HTP module, thespatiotemporal features are acquired by performing temporal pooling operationbetween neighboring frames, and the features after the first HTP module (denoted as

“HTP_1” in Fig. 1) are denoted as Y 1ð Þ1 ;Y 1ð Þ

2 ; � � � ;Y 1ð ÞN2

n o.

Assume the total number of HTP modules isM, then the output features F jð Þ of thej-th HTP module (HTP_j) are denoted as:

F jð Þ ¼ Y jð Þ1 ;Y jð Þ

2 ; � � � ;Y jð ÞN2 j

� �ð1Þ

where Y jð Þi 2 R

d , i ¼ 1; 2; . . .; N2 j and j ¼ 1; 2; . . .;M, d is the dimension of the outputfeatures. To get a video-level spatiotemporal representation, let M ¼ log2 N, so that the

last HTP module (HTP_M) only outputs a single feature Y Mð Þ1 , which contains adequate

information for video feature learning, we will show the superior performance later inSect. 3.

The proposed network architecture in this paper is interpretable and efficient, and itis evident that the HTP-Net can be easily trained end-to-end. Instead of predictingaction classes by aggregating numerous frame-level predictions, our approach onlyprocesses N frames to get video-level features at runtime, which makes it capable ofinferring the action happens in the video at the first sight, without “hesitation”.

2.2 Proposed HTP Module

Recently, for image classification task, the rise of 2D ConvNets convincinglydemonstrates high performance of capturing useful static information especially inspatial domain. The motivation of our HTP module is to utilize the 2D ConvNets toencode appearance effectively in individual frames, and integrate temporal correlationinto spatial representation by performing 3D pooling on time domain.

As shown in Fig. 2, the HTP module contains two operations: dimension permu-tation and 3D temporal pooling. Note that “dimension permutation” can be dividedinto “temporal stacking” and “spatial stacking”. Details are elaborated below.

Dimension Permutation. 2D ConvNets process 3D tensors of size C �W � H,where C denotes the channel number, W and H denote the width and height in spacedimension respectively, which means the temporal relation within the video frames isignored and N frames are treated similarly to channels. While 3D ConvNets operate 4Dtensors of size C � T �W � H, where T is the time span. Therefore, dimensionpermutation is essential as the 2D and 3D operations exist alternately in our HTP-net,including temporal stacking and spatial stacking.

474 C. Zhang et al.

Page 5: Hierarchical Temporal Pooling for Efficient Online Action ...

Temporal Stacking aims to provide correct input volumes for 3D pooling bystacking the feature maps in time order, so the tensors’ dimensions are transformedfrom 3D to 4D. Figure 2 shows the details of the first HTP module as an example.Specifically, the “2D Blocks_1” (weight sharing) receive N frames as input and pro-duce N feature representations, and each of the representations can be expressed asvolumes Xi 2 R

K�W�H i ¼ 1; 2; . . .;Nð Þ, where K denotes the number of convolutionalfilters applied in the end of the blocks. Xi consists of K feature maps of equal size

P ið Þj 2 R

W�H j ¼ 1; 2; . . .;Kð Þ, so each feature vector obtained from the 2D blocks canbe represented as:

Xi ¼ P ið Þ1 ;P ið Þ

2 ; . . .;P ið ÞK

n oð2Þ

Feature map P ið Þj is the basic unit during permutation procedure. Feature maps with

the same j index will be stacked by ascending order of the i index. In other words,every feature map in the original volume Xi i ¼ 1; 2; . . .;Nð Þ will be evenly assigned toK permuted volumes X

0m, where X

0m 2 R

N�W�H m ¼ 1; 2; . . .;Kð Þ.For example, the first permuted volume X

01 contains a total number of N feature

maps P ið Þ1 i ¼ 1; 2; . . .;Nð Þ with the same j index (in this case j ¼ 1), and each feature

map is sorted chronologically. Hence after permutation, X01 ¼ P 1ð Þ

1 ;P 2ð Þ1 ; . . .;P Nð Þ

1

h i.

Therefore, all the permuted features can be derived as follows:

X0m ¼ P 1ð Þ

m ;P 2ð Þm ; . . .;P Nð Þ

m

n oð3Þ

In conclusion, the input volumes of the HTP module are N 3D tensors of sizeK �W � H, which cannot be directly processed by 3D pooling due to dimensionmismatch. After temporal stacking, the extracted frames number N, which representsthe time span of the input video, will be transposed to time dimension. Specifically, N3D tensors of size K �W � H will be correctly permuted to a 4D tensor of sizeK � N �W � H. So far, 3D pooling can deal with the 4D tensor properly.

Spatial Stacking aims to provide correct input volumes for the following 2D blocksby stacking the feature maps in spatial order, so the tensors’ dimensions are trans-formed from 4D to 3D. To some extent, spatial stacking is the inverse transformation ofthe temporal stacking. Note that temporal downsampling is performed at HTP module,so the time span changes from N to N 0 (in this case N 0 ¼ N

2). Similarly, the pooled

features can be expressed as volumes Y0m 2 R

N 0�W�H m ¼ 1; 2; . . .;Kð Þ:

Y0m ¼ Q 1ð Þ

m ;Q 2ð Þm ; . . .;Q N0ð Þ

m

n oð4Þ

where each Q i0ð Þj 2 R

W�H i0 ¼ 1; 2; . . .;N 0; j ¼ 1; 2; . . .;Kð Þ represents the obtainedfeature map after pooling. To be further processed by the upcoming 2D blocks, it’simportant that the feature maps should be re-stacked spatially. Similar to the pattern of

Hierarchical Temporal Pooling for Efficient Online Action Recognition 475

Page 6: Hierarchical Temporal Pooling for Efficient Online Action ...

temporal stacking, feature maps with the same i0 index will be stacked by ascendingorder of the j index, indicated below:

Y i0 ¼ Q i0ð Þ1 ;Q i0ð Þ

2 ; . . .;Q i0ð ÞK

n oð5Þ

After spatial stacking, the feature volumes are restored to N 0 tensors of sizeK �W � H, so that the following 2D ConvNets can operate the tensors correctly.

3D Temporal Pooling. Based on the observations that: (1) redundant informationexists widely in consecutive sampled frames; (2) 2D convolution has the ability ofencoding spatial information effectively in individual frames. We find that it’s essentialto pool across time dimension to merge the temporal and spatial information simul-

taneously. As mentioned above, given the input feature maps P ið Þj i ¼ 1; 2; . . .;ð

N; j ¼ 1; 2; . . .;KÞ, the obtained feature maps after 3D pooling operation are

Q i0ð Þj i0 ¼ 1; 2; . . .;N 0; j ¼ 1; 2; . . .;Kð Þ. In each HTP module, let j equals a specific value

j0, the response of each pooling layer is obtained by a function:

H : P mð Þj0 ;P mþ 1ð Þ

j0 ; . . .;P nð Þj0

n o! Q m!nð Þ

j0 ð6Þ

Fig. 2. Details of our HTP module. The HTP module contains two operations: dimensionpermutation and 3D temporal pooling. Note that “dimension permutation” includes “temporalstacking” and “spatial stacking”. Before and after 3D temporal pooling, the series of feature mapsare permuted in HTP modules. According to the time order, the process can also be summarizedas three steps: (1) temporal stacking; (2) 3D temporal pooling; (3) spatial stacking.

476 C. Zhang et al.

Page 7: Hierarchical Temporal Pooling for Efficient Online Action ...

where m; n 2 1; 2; . . .;Nf g, and m\n. Obviously, we can use different poolingfunctions. For presentation clarity, some commonly used pooling functionsH are givenbelow.

• Average pooling:

Q m!nð Þj0 ¼ P mð Þ

j0 � P mþ 1ð Þj0 � . . .� P nð Þ

j0

� �= n� mþ 1ð Þ ð7Þ

• Max pooling:

Q m!nð Þj0 ¼ max P mð Þ

j0 ;P mþ 1ð Þj0 ; . . .;P nð Þ

j0

n oð8Þ

3 Experiments and Analysis

3.1 Experimental Settings

Network Architecture Details. In consideration of the trade-off between accuracy andefficiency, we choose BN-Inception as the backbone network. As common practice,here we choose the number of sampled frames N ¼ 16. In order to only obtain a singlevideo-level feature after the last HTP module (HTP_M), the total layer number Mshould be 4 (M ¼ log2 N ¼ 4). The architecture details are shown in Table 1.

Table 1. HTP-Net architecture details. This network receives an input size of 16 � 224 � 224to keep a balance between memory capacity and runtime efficiency. Temporal downsampling isperformed in each “HTP_x” module. The 2D patch size corresponds to W � H, while the 3Dcounterpart represents T � W � H. The 4D output size corresponds to C � T � W � H.

Layer namePatch

size/stride Output size Layer name

Patch size/stride

Output size

conv1 7×7/2 64×16×112×112 inception (4b) 576×4×14×14

2D max pool1 3×3/2 64×16×56×56 inception (4c) 608×4×14×14

conv2 3×3/1 192×16×56×56 inception (4d) 608×4×14×14

HTP_1 2×3×3/2×2×2 192×8×28×28 inception (4e) stride 2 1056×4×7×7

inception (3a) 256×8×28×28 HTP_3 2×1×1/2×1×1 1056×2×7×7

inception (3b) 320×8×28×28 inception (5a) 1024×2×7×7

inception (3c) stride 2 576×8×14×14 inception (5b) 1024×2×7×7

HTP_2 2×1×1/2×1×1 576×4×14×14 HTP_4 2×1×1/2×1×1 1024×1×7×7

inception (4a) 576×4×14×14 2D avg pool, dropout, “#class”-d fc, softmax

Hierarchical Temporal Pooling for Efficient Online Action Recognition 477

Page 8: Hierarchical Temporal Pooling for Efficient Online Action ...

Note that the patch size and stride of “HTP_1” module differ from other “HTP_x”modules. Considering the spatial downsampling is performed in the original 2D BN-Inception network, the spatial and temporal downsampling need to be combined.

Datasets. We evaluate the performance of HTP-Net on the most commonly well-known action recognition benchmarks: UCF101 [17] and HMDB51 [18]. The UCF101dataset includes 13,320 video clips with 101 action classes. The video sequences inHMDB51 dataset are extracted from various sources, including movies and onlinevideos. This dataset contains 6,766 videos with 51 actions. In our experiments, wefollow the official evaluation scheme that three standard training and testing splits areevaluated separately and the mean average accuracy over these three splits are calcu-lated as the final result.

Implementation Details. 16 frames are randomly selected from each equally dividedsubsections, and this sampling strategy ensures the whole video to be processed withreasonable computational cost and brings appearances diversity due to the randomselection scheme. We use mini-batch SGD optimization method and utilize dropout ineach fully connected layer to train our HTP-Net. The learning rate is initialized as 0.001and reduces by a factor of 10 when the validation error saturates. The HTP-Net istrained with batch size of 32, momentum of 0.9 and dropout ratio of 0.8. Data aug-mentation techniques introduced in [2, 5] are applied to produce appearance diversityas well as prevent serious over-fitting problem. Specifically, the size of input frames arefixed as 340 � 256, then we employ scale jittering with horizontal flipping and cornercropping. These cropped regions will be resized to 224 � 224 before being fed into thenetwork.

3.2 Benchmark Comparison

After detailed elaboration of HTP-Net architectures and experimental settings, finalbenchmark experiments are conducted on UCF101 and HMDB51 datasets over threestandard splits for further evaluating the performance of our proposed HTP-Net. Herethree setups are considered: (1) only RGB images as input; (2) only stacked opticalflow images as input; (3) two-stream fusion strategy using RGB and optical flowimages simultaneously, which lead to three different networks. These three networksare denoted as HTP-Net (RGB), HTP-Net (Optical Flow) and HTP-Net (Two-stream)respectively for clear and concise presentation. The accuracy results on each testingsplits are summarized in Table 2. As shown in the last row of Table 2, for UCF101dataset, the average accuracies of HTP-Net (RGB), HTP-Net (Optical Flow) and HTP-Net (Two-stream) are 90.2%, 93.0% and 96.2%, respectively. As for HMDB51 dataset,the average accuracies are 62.9%, 74.7% and 77.6%, respectively. Obviously, HTP-Net(Two-stream) outperforms other two networks and optical flow information does helpin improving action recognition accuracy.

In the following, we conduct experiment to compare the average accuracy of HTP-Net (Two-stream) with several state-of-the-art methods on UCF101 and HMDB51benchmarks. In this experiment, the comparison methods include traditional methods[8], baseline networks [5–7, 11] and recent mainstream approaches [14–16, 19, 20].The results are reported in Table 3. As shown in Table 3, HTP-Net (Two-stream)

478 C. Zhang et al.

Page 9: Hierarchical Temporal Pooling for Efficient Online Action ...

obtains superior results, which outperforms previous best approach by 0.4% onUCF101 and 2.8% on HMDB51.

3.3 Efficiency Comparison

Without doubt, calculating optical flow is time-consuming. So training a two-streamConvNets using stacked optical flow images as input ask for more computational cost.Hence, in consideration of real-time action recognition, using optical flow is not a goodchoice. As a common practice so far for action recognition task, the efficiency com-parison is conducted by using raw RGB input only. In this subsection, we only evaluatethe efficiency performance of our HTP-Net (RGB). All experiments are running on anNVIDIA Titan X GPU.

Table 2. The accuracy performance on UCF101 and HMDB51.

# UCF101 Accuracy (%) HMDB51 Accuracy (%)HTP-Net(RGB)

HTP-Net(opticalflow)

HTP-Net(two-stream)

HTP-Net(RGB)

HTP-Net(opticalflow)

HTP-Net(two-stream)

Split1 90.0 91.5 95.7 63.9 74.9 79.2Split2 90.8 93.7 96.8 62.3 73.9 76.0Split3 89.7 93.7 96.0 62.6 75.4 77.5Average 90.2 93.0 96.2 62.9 74.7 77.6

Table 3. Accuracy comparison with state-of-the-art methods.

Method Backbone Network UCF101 (%) HMDB51 (%)

IDT [8] – 85.9 57.2Two-stream [6] VGG-M 88.0 59.4TDD [7] VGG-M 90.3 63.2C3D [11] ResNet-18 85.2 –

TSN [5] BN-Inception 94.2 70.7DOVF [15] BN-Inception 94.9 71.7ActionVLAD [19] VGG-16 92.7 66.9TLE [20] BN-Inception 95.6 71.1ECOEn-RGB [14] BN-Inception 94.8 72.4DTPP [16] BN-Inception 95.8 74.8HTP-Net (two-stream) BN-Inception 96.2* 77.6*

* indicates the best results.

Hierarchical Temporal Pooling for Efficient Online Action Recognition 479

Page 10: Hierarchical Temporal Pooling for Efficient Online Action ...

Here, three evaluation metrics are used: speed, model size and accuracy. And twospeed measurement metrics are reported: videos per second (vps) and frames persecond (fps). The results are summarized in Table 4. For visualization purpose, abubble chart is displayed in Fig. 3.

From Table 4, for the running speed, it is encouraged to see that our HTP-Net(RGB) outperforms TSN (2D CNN), Res3D (3D CNN) and ECO (2D-3D combinedCNN) by 29.4vps, 40.9vps and 6.7vps, respectively. These results indirectly illustratethe ability of our HTP-net (RGB) to efficiently encode the spatio-temporal informationof videos. Besides, as expected, the model size of our HTP-Net (RGB) is comparable

Table 4. Efficiency comparison with five state-of-the-art methods with NVIDIA Titan X GPUon UCF101 and HMDB51 datasets (only using RGB images as input). Note that I/O time is notconsidered for the reported speed.

Method Speed (vps/fps) Model size (MB) UCF101 (%) HMDB51 (%)

Res3D [12] 1.1/- 144 85.8 54.9ARTNet [21] 1.8/- 151 93.5* 67.6TSN [5] 12.6/- 39.7* 87.7 51.0ECO16F [14] 24.5/392.0 >128 92.8 68.5*ECOLite-16F [14] 35.3/564.8 128 91.6 68.2HTP-Net (RGB) 42*/672* 39.7* 90.2 62.9

* indicates the best results.

Fig. 3. Efficiency comparison on UCF101 (over three splits) for HTP-Net (RGB) and otherstate-of-the-art methods. The bubble size and the model size are positive correlation. Ourapproach HTP-Net (RGB) (red bubble) is a trade-off among the three evaluation metrics: speed,model size and accuracy. (Color figure online)

480 C. Zhang et al.

Page 11: Hierarchical Temporal Pooling for Efficient Online Action ...

with that of TSN but much smaller than other methods. Specifically, our HTP-Net(RGB) only occupies 39.7 MB storage, while other methods (except TSN) evenreaches 151 MB which is 3-4 times larger. It is clear that our HTP-Net (RGB) benefitsfrom the less parameters in 2D ConvNets and the ability of modeling spatio-temporalinformation effectively by 3D pooling. However, from the last two columns in Table 4,we can see that the action recognition accuracy of our HTP-Net (RGB) is higher thanthat of Res3D and TSN but lower than that of ARTNet and ECO which utilize morecomplex networks. Moreover, from the results shown in Table 2, it can be concludedthat optical flow modality is still able to provide supplementary information for actionrecognition. In the future, we intend to further improve our HTP-Net (RGB) to narrowthe accuracy gap between single stream and two-stream inputs.

As shown in Fig. 3, the small red bubble in the upper right corner represents HTP-Net (RGB), which clearly shows that our devised HTP-Net (RGB) is a computationallyefficient light model with competitive action recognition accuracy.

4 Conclusion

In this paper, a delicate Hierarchical Temporal Pooling (HTP) is proposed, which is alight-weighted module for merging the temporal motion and spatial appearanceinformation simultaneously. With the two-stream ConvNets, an efficient actionrecognition network termed as HTP-Net is developed, which is able to obtain theeffective video-level representations. As demonstrated on UCF101 and HMDB51datasets, it is encouraged to see that our HTP-Net (Two-stream) has brought the state-of-the-art results to a new level, and HTP-Net (RGB) processes videos much faster withsmaller model size. Specifically, our model with HTP-Net (RGB) runs at 42 videos persecond (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU withcompetitive action recognition accuracy, which is able to perform real-time actionrecognition and is of great value in practical applications.

In the future, we will work on improving the action recognition accuracy of ourHTP-Net (RGB) while maintaining its outstanding properties in terms of light modeland computational efficiency.

Acknowledgment. This paper was partially supported by the Shenzhen Science & TechnologyFundamental Research Program (No: JCYJ20160330095814461) & Shenzhen Key Laboratoryfor Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467). Special Acknowl-edgements are given to Aoto-PKUSZ Joint Research Center of Artificial Intelligence on SceneCognition & Technology Innovation for its support.

References

1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutionalneural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105(2012)

2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556 (2014)

Hierarchical Temporal Pooling for Efficient Online Action Recognition 481

Page 12: Hierarchical Temporal Pooling for Efficient Online Action ...

3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

4. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pp. 1–9 (2015)

5. Wang, L., et al.: Temporal segment networks: towards good practices for deep actionrecognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol.9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

6. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition invideos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

7. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutionaldescriptors. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 4305–4314 (2015)

8. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of theIEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

9. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning forhuman action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065,pp. 29–39. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25446-8_4

10. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human actionrecognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2013)

11. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal featureswith 3D convolutional networks. In: 2015 IEEE International Conference on ComputerVision (ICCV), pp. 4489–4497. IEEE (2015)

12. Tran, D., Ray, J., Shou, Z., Chang, S.-F., Paluri, M.: ConvNet architecture search forspatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)

13. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducinginternal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

14. Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online videounderstanding. arXiv preprint arXiv:1804.09066 (2018)

15. Lan, Z., Zhu, Y., Hauptmann, A.G., Newsam, S.: Deep local video feature for actionrecognition. In: 2017 IEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 1219–1225. IEEE (2017)

16. Zhu, J., Zou, W., Zhu, Z.: End-to-end video-level representation learning for actionrecognition. arXiv preprint arXiv:1711.04161 (2017)

17. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes fromvideos in the wild. arXiv preprint arXiv:1212.0402 (2012)

18. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database forhuman motion recognition. In: 2011 IEEE International Conference on Computer Vision(ICCV), pp. 2556–2563. IEEE (2011)

19. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR, p. 3 (2017)

20. Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

21. Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for videoclassification. arXiv preprint arXiv:1711.09125 (2017)

482 C. Zhang et al.