GPU Accelerated Sequence Learning for Action...

GPU Accelerated Sequence Learning for Action Recognition

Yemin Shi

[email protected]

2018-03

1

BackgroundObject Recognition

(Image Classification)

Action Recognition(Video Classification)

Action Recognition VS Object Recognition Temporal domain, Long-term dependence, High computational

complexity.

General methods are not good enough for action recognition.

Existing methods are still far from practical use

Research Trends

Datasets Year Actions Videos Annotations Source Localization

HMDB51 2011 51 7K 7K YouTube/Movie

No

UCF101 2012 101 13K 13K YouTube No

Sports 1M 2014 487 1.1M 1.1M YouTube No

THUMOS 15 2014 101 24K 21K YouTube Yes

ActivityNet 2015 200 20K 23K YouTube Yes

Charades 2016/2017

157 10K 67K 267Homes

Yes

AVA 2017 80 214 197K Movie Yes

Kinetics 2017 400 305K 305K YouTube No

MIT 2017 339 1M 1M 10 sources No

SLAC 2017 200 520K 1.75M YouTube Yes

Action Recognition

Modeling temporal domain is one of the most important target of action recognition.

Shortcomings of existing methods: Action have long duration: High complexity

LSTM is not good enough.

Therefore, we need： Some more efficient sequence learning model to

improve the ability of modeling temporal information.

Overview

Hand-crafted

Features

Hand-crafted Featuresand Deep Features Importance of Each Frame

The Ability of Modeling Temporal Domain

One-shot Action Recognition

Deep Trajectory Descriptor

Temporal Attentive Network

shuttleNet

Open-set Action Recognition

Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning

Overview

Sequence learning for action recognition Deep Trajectory Descriptor


shuttleNet

Hierarchical Temporal Memory Enhanced One-shot Distance Learning

Open Deep Network


Problems and Solutions Hand-crafted feature can hardly describe movement

process; CNNs are good at describe structure.

Integrate hand-crafted feature and CNN to improve performance.

Hand-crafted feature:More statistics, less structure.

CNN:Structure is important.


Improve Dense Trajectory with Background Subtraction Only extract trajectories and optical flow on foreground.

Videos Masks Foreground

Where 𝑆𝑓𝑜𝑟𝑒 is the sum of the foreground square area. (𝑖, 𝑗) index around the square area.


Main Idea Trajectory Texture Image: Project trajectories onto a canvas.

CNN is employed for structural feature learning.

Input video

Dense trajectories

Project into 2D space

Projection in an

adaptative duration

Trajectory

Texture Image

Conv Pooling LRN

… …

Conv FC


Improve trajectory projection method


DTD with LSTM Treat each Trajectory Texture Image as one time step

input, LSTM is used to model temporal domain.

Improve the ability of DTD to model complex action.

Our LSTM Model

𝑥𝑡 is the input at time t.

ℎ𝑡 is the hidden state at time t.

𝑖𝑡, 𝑓𝑡 , 𝑐𝑡 , 𝑜𝑡 are the input Gate、forget gate、memory cell and output gate at time t.

Learn long-term action description CNN for DTD feature

learning;

Sequential DTD for long-term action representation;

RNN(LSTM) for temporal domain modeling.


ApplyEyeMakeup

𝐽 𝜃 = −1

𝑚

𝑖=1

𝑚

𝑗=1

𝑘

1 𝑦 𝑖 = 𝑗 log𝑒θ𝑗

𝑇𝑥 𝑖

σ𝑙=1𝑘 𝑒𝜃𝑙

𝑇𝑥 𝑖

+𝜆

2

𝑖=1

𝑘

𝑗=0

𝑛

𝜃𝑖𝑗2

Loss function：Softmax Loss

𝑊𝑒𝑖𝑔ℎ𝑡 R𝑒𝑔𝑢𝑙ar𝑖𝑧𝑒r


Three-stream Framework


Experiment results

Overview



shuttleNet


Open Deep Network


Problems and solutions Not all postures contribute equally to the successful

recognition of an action.

Texture and motion are not independent from each other.

The most important frames for RGB and optical flow may not be corresponding (not the same frame id).


Attention mechanism


𝑒𝑖𝑗 = 𝑣𝑇 tanh 𝑊1′ℎ𝑖 +𝑊2

′𝑔𝑗

𝛼𝑖𝑗 =exp 𝑒𝑖𝑗

σ𝑘=1𝑇 exp 𝑒𝑘𝑗

𝑜𝑗ℎ =

𝑖=1

𝑇

𝛼𝑖𝑗ℎ𝑖

𝑓𝑗𝑖 = 𝑢𝑇 tanh 𝑊3′𝑔𝑖 +𝑊4

′ℎ𝑗

𝛽𝑗𝑖 =exp 𝑓𝑗𝑖

σ𝑘=1𝑇 exp 𝑓𝑗𝑘

𝑜𝑖𝑔=

𝑗=1

𝑇

𝛽𝑗𝑖𝑔𝑗

Spatial domain

Temporal Domain

Weight for each input

Weighted sum for all inputs


Experiment results

Overview



shuttleNet


Open Deep Network

shuttleNet

Problems and solutions Most deep neural networks are generated by only

feed-forward connections.

Existing RNN are still not good enough in practice.

[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

Visual Cortical Pathways[Siegelbaum’00]

Blue arrow: feed-forward connectionRed arrow: feed-back connection

IT

V1

V2V4

TEO

shuttleNet

Loop Connections

Input Projection

Output Selection

shuttleNet

Experiment results

Comparing with existing RNNs

Comparing with other action recognition methods

Overview



shuttleNet


Open Deep Network

Motivation

Videos are complicated because of temporal complexity and variation Distance learning can decrease intra-class distance

while increasing inter-class distance.

Method: Triplet loss

Not all frames equally contribute to recognition The harder to predict one frame, the more

representative it is.

Method: Hierarchical Temporal Memory (HTM)

Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.

https://en.wikipedia.org/wiki/International_Standard_Book_Number

https://en.wikipedia.org/wiki/Special:BookSources/0805074562

Framework

Seen-class Stage

Matching Network training Sample a target video and a support set video from

seen classes, maximize the probability of the class that the target video belongs to.

HTM training Make HTM accustomed to seen class videos.

Unseen-class Stage

Triplet loss

Experiments

Overview



shuttleNet


Open Deep Network

Open Deep Network

Motivation Action recognition in the real world is essentially an

open-set problem Impossible to know all action categories beforehand;

Infeasible to prepare sufficient training samples for those emerging categories.

Most of recognition systems are designed for a static closed world Primary assumption: all categories are known as priori.

Known

Train/Test

Known Unknown

Train

Test

Open Deep Network

Multi-class unknown category detection The multi-class triplet thresholding method

Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns

Training a triplet threshold [𝜂𝑖, 𝜇𝑖, 𝛿𝑖] per category

Applying the triplet threshold on each sample during the detection process

Define: [𝜂𝑖, 𝜇𝑖, 𝛿𝑖]Accept threshold：𝜂𝑖 = alpha ∗ 𝑀𝑒𝑎𝑛 σ𝑗=1

𝑋 𝑓𝑖,𝑗Reject threshold：𝜇𝑖 = beta ∗ 𝜂𝑖Distance threshold：𝛿𝑖 = sigma ∗ 𝑀𝑒𝑎𝑛(σ𝑗=1

𝑋 (𝑓𝑖,𝑗 − 𝑠𝑖,𝑗))

where:

𝑓𝑖,𝑗 is the maximal score of the i-th category

𝑠𝑖,𝑗 is the second maximal score of the i-th

category

Updating deep network Reconstruct the classification layer

37

wN+1 = wN+1′ + wN+1

′′

=σn=1N wn

N+

σm=1M wm

M

Transfer knowledge from the trained categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.

①

Open Deep Network

Weight matrix

Updated Weight matrix

New weight column for new category

Updating deep network Reconstruct the classification layer

38

wN+1 = wN+1′ + wN+1

′′

=σn=1N wn

N+

σm=1M wm

M

Transfer knowledge from the trained categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.

①

Open Deep Network

Weight matrix

Updated Weight matrix

New weight column for new category

Transfer knowledge from the similar categories ：The similar categories should play a more critical role in the initialization.

②

Similar categories

Incremental training Balanced training strategy：Do guarantee that each of the known

categories have the same number of samples as the new category for fine-tuning to reduce jitter of the model.

Allometry training strategy ：Adopt learning rate decay matrix, which varies between known categories and new categories to force new categories learn much faster than known categories during the fine-tuning.

Allometry training Allometry training factor:

Updating the weights:

𝑊𝑖,𝑗 = 𝑊𝑖,𝑗 − 𝛼𝑖𝜆𝜕

𝜕 𝑊𝑖,𝑗𝐽 𝑊, 𝑏

𝑏𝑖 = 𝑏𝑖 − 𝛼𝑖𝜆𝜕

𝜕 𝑏𝑖𝐽 𝑊, 𝑏

39

αi = ቐ0.1，i ≤ N

1， i > N

Open Deep Network

Open Deep Network

Comparing Our Initialization and Stochastic Initialization

The accuracies under different openness

Testing accuracies at each iteration The accuracies of known categories and unknown categories at each iteration

① ②

③ ④

Open Deep Network

Experiment results ODN VS closed-set action recognition

5.39 samples vs 94.4 samples

Comparable performance

Current Situation of Action Recognition

High computation cost of optical flow;

High computation cost of deep learning models. 3D ConvNet is much faster, but still very heavy.

Real-time capability of optical flows

Optical flow TVL1 FB S2D PCA DIS Flownet2 Brox LK

Accuracy of UCF101 split 1

72.98 70.58 65.34 69.6 63.28 57.71 72.35 37.51

Per frame time (340x256)

0.181 0.056 0.026 0.053 0.07 0.123(+-) 0.074 0.0034

fps 5.5 17.9 38.5 18.9 14.3 8.1 13.5 294.1

Speed and accuracy of several optical flow methods

Trend: Abandon optical flow

3D CNN

Year Model Method Inputdimension

2012 3D CNN Replace 2D conv with 3D conv 7x33x60x40

2015 C3D Deep 3D ConvNet 16x3x112x112

2016 ResNet3D 3D Resnet50 80x6x80x80

2017 I3D Inflate 2D kernels to 3D kernelsand copy weights along the 3rd

dimension

64x3x224x224

2017 NL I3D Non-local operation 32x8x224x224

2017 P3D Simulate 3x3x3 kernel with 1x3x3 and 3x1x1 kernels

16x3x160x160

com

plex

ity

I3D has 107.9B FLOPs

Google uses 64 GPUs to train I3D model

GPU selection？

I3D experiments based on TensorFlow The train/test code is available at

https://github.com/shiyemin/shuttleNet

Input size: 64x3x224x224

Batch Norm needs big enough batch size, which will consume a lot of GPU memory.

107.9B FLOPS: Need sufficient computational power to reduce training time.

https://github.com/shiyemin/shuttleNet

GPU selection？

I3D experiments based on TensorFlow Note: We do not use the NVLink connections between GPUs.

Because of no further multi-gpu optimization, our results should be slower than others.

GPU #GPU Max Batch Size Train FPS Test FPS

K40 2 16 149.87 449.1

K80* 2 16 148.65 411.3

K80* 4 32 278.61 808.2

K80* 8 64 562.46 1649.4

K80* 16 112 1046.24 3298.8

P100 2 22 622.33 1692.5

P100 4 44 1212.01 3338

*Here, one K80 is one core of K80 card (each K80 card has two core).

Reference

Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good

practices for deep action recognition[C]//European Conference on Computer Vision.

Springer International Publishing, 2016: 20-36.

Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent

convolutional networks for visual recognition and description[C]//Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625-2634.

Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets:

Deep networks for video classification[C]//Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition. 2015: 4694-4702.

Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep

learning framework for video classification[C]//Proceedings of the 23rd ACM

international conference on Multimedia. ACM, 2015: 461-470.

Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J].

arXiv preprint arXiv:1511.04119, 2015.

Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM

approach[C]//Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th

International Conference on. IEEE, 2004, 3: 32-36.

Reddy K K, Shah M. Recognizing 50 human action categories of web videos[J].

Machine Vision and Applications, 2013, 24(5): 971-981.

Reference

Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York:

McGraw-hill, 2000.

Wang L M, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion

recognition[C]//Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition. 2013: 2674-2681.

Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary

descriptors for action recognition[J]. International journal of computer vision, 2013,

103(1): 60-79.

Cai Z, Wang L, Peng X, et al. Multi-view super vector for action

recognition[C]//Proceedings of the IEEE conference on Computer Vision and

Pattern Recognition. 2014: 596-603.

Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action

recognition: Comprehensive study and good practice[J]. Computer Vision and

Image Understanding, 2016.

Liu L, Shao L, Rockett P. Boosted key-frame selection and correlated pyramidal

motion-feature representation for human action recognition[J]. Pattern recognition,

2013, 46(7): 1810-1818.

Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-

convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition. 2015: 4305-4314.

Reference

Wang L, Qiao Y, Tang X. MoFAP: A multi-level representation for action

recognition[J]. International Journal of Computer Vision, 2015: 1-18.

Varol G, Laptev I, Schmid C. Long-term Temporal Convolutions for Action

Recognition[J]. arXiv preprint arXiv:1604.04494, 2016.

Wang L, Xiong Y, Wang Z, et al. Towards good practices for very deep two-stream

convnets[J]. arXiv preprint arXiv:1507.02159, 2015.

Thanks!

GPU Accelerated Sequence Learning for Action...

Documents

Transcript of GPU Accelerated Sequence Learning for Action...