GPU Accelerated Sequence Learning for Action...
Transcript of GPU Accelerated Sequence Learning for Action...
BackgroundObject Recognition
(Image Classification)
Action Recognition(Video Classification)
Action Recognition VS Object Recognition Temporal domain, Long-term dependence, High computational
complexity.
General methods are not good enough for action recognition.
Existing methods are still far from practical use
Research Trends
Datasets Year Actions Videos Annotations Source Localization
HMDB51 2011 51 7K 7K YouTube/Movie
No
UCF101 2012 101 13K 13K YouTube No
Sports 1M 2014 487 1.1M 1.1M YouTube No
THUMOS 15 2014 101 24K 21K YouTube Yes
ActivityNet 2015 200 20K 23K YouTube Yes
Charades 2016/2017
157 10K 67K 267Homes
Yes
AVA 2017 80 214 197K Movie Yes
Kinetics 2017 400 305K 305K YouTube No
MIT 2017 339 1M 1M 10 sources No
SLAC 2017 200 520K 1.75M YouTube Yes
Action Recognition
Modeling temporal domain is one of the most important target of action recognition.
Shortcomings of existing methods: Action have long duration: High complexity
LSTM is not good enough.
Therefore, we need: Some more efficient sequence learning model to
improve the ability of modeling temporal information.
Overview
Hand-crafted
Features
Hand-crafted Featuresand Deep Features Importance of Each Frame
The Ability of Modeling Temporal Domain
One-shot Action Recognition
Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Open-set Action Recognition
Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
Deep Trajectory Descriptor
Problems and Solutions Hand-crafted feature can hardly describe movement
process; CNNs are good at describe structure.
Integrate hand-crafted feature and CNN to improve performance.
Hand-crafted feature:More statistics, less structure.
CNN:Structure is important.
Deep Trajectory Descriptor
Improve Dense Trajectory with Background Subtraction Only extract trajectories and optical flow on foreground.
Videos Masks Foreground
Where 𝑆𝑓𝑜𝑟𝑒 is the sum of the foreground square area. (𝑖, 𝑗) index around the square area.
Deep Trajectory Descriptor
Main Idea Trajectory Texture Image: Project trajectories onto a canvas.
CNN is employed for structural feature learning.
Input video
Dense trajectories
Project into 2D space
Projection in an
adaptative duration
Trajectory
Texture Image
Conv Pooling LRN
… …
Conv FC
Deep Trajectory Descriptor
DTD with LSTM Treat each Trajectory Texture Image as one time step
input, LSTM is used to model temporal domain.
Improve the ability of DTD to model complex action.
Our LSTM Model
𝑥𝑡 is the input at time t.
ℎ𝑡 is the hidden state at time t.
𝑖𝑡, 𝑓𝑡 , 𝑐𝑡 , 𝑜𝑡 are the input Gate、forget gate、memory cell and output gate at time t.
Learn long-term action description CNN for DTD feature
learning;
Sequential DTD for long-term action representation;
RNN(LSTM) for temporal domain modeling.
Deep Trajectory Descriptor
ApplyEyeMakeup
𝐽 𝜃 = −1
𝑚
𝑖=1
𝑚
𝑗=1
𝑘
1 𝑦 𝑖 = 𝑗 log𝑒θ𝑗
𝑇𝑥 𝑖
σ𝑙=1𝑘 𝑒𝜃𝑙
𝑇𝑥 𝑖
+𝜆
2
𝑖=1
𝑘
𝑗=0
𝑛
𝜃𝑖𝑗2
Loss function:Softmax Loss
𝑊𝑒𝑖𝑔ℎ𝑡 R𝑒𝑔𝑢𝑙ar𝑖𝑧𝑒r
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
Temporal Attentive Network
Problems and solutions Not all postures contribute equally to the successful
recognition of an action.
Texture and motion are not independent from each other.
The most important frames for RGB and optical flow may not be corresponding (not the same frame id).
Temporal Attentive Network
𝑒𝑖𝑗 = 𝑣𝑇 tanh 𝑊1′ℎ𝑖 +𝑊2
′𝑔𝑗
𝛼𝑖𝑗 =exp 𝑒𝑖𝑗
σ𝑘=1𝑇 exp 𝑒𝑘𝑗
𝑜𝑗ℎ =
𝑖=1
𝑇
𝛼𝑖𝑗ℎ𝑖
𝑓𝑗𝑖 = 𝑢𝑇 tanh 𝑊3′𝑔𝑖 +𝑊4
′ℎ𝑗
𝛽𝑗𝑖 =exp 𝑓𝑗𝑖
σ𝑘=1𝑇 exp 𝑓𝑗𝑘
𝑜𝑖𝑔=
𝑗=1
𝑇
𝛽𝑗𝑖𝑔𝑗
Spatial domain
Temporal Domain
Weight for each input
Weighted sum for all inputs
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
shuttleNet
Problems and solutions Most deep neural networks are generated by only
feed-forward connections.
Existing RNN are still not good enough in practice.
[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
Visual Cortical Pathways[Siegelbaum’00]
Blue arrow: feed-forward connectionRed arrow: feed-back connection
IT
V1
V2V4
TEO
shuttleNet
Problems and solutions Most deep neural networks are generated by only
feed-forward connections.
Existing RNN are still not good enough in practice.
[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
Visual Cortical Pathways[Siegelbaum’00]
Blue arrow: feed-forward connectionRed arrow: feed-back connection
IT
V1
V2V4
TEO
shuttleNet
Experiment results
Comparing with existing RNNs
Comparing with other action recognition methods
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
Motivation
Videos are complicated because of temporal complexity and variation Distance learning can decrease intra-class distance
while increasing inter-class distance.
Method: Triplet loss
Not all frames equally contribute to recognition The harder to predict one frame, the more
representative it is.
Method: Hierarchical Temporal Memory (HTM)
Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.
Seen-class Stage
Matching Network training Sample a target video and a support set video from
seen classes, maximize the probability of the class that the target video belongs to.
HTM training Make HTM accustomed to seen class videos.
Overview
Sequence learning for action recognition Deep Trajectory Descriptor
Temporal Attentive Network
shuttleNet
Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Open Deep Network
Open Deep Network
Motivation Action recognition in the real world is essentially an
open-set problem Impossible to know all action categories beforehand;
Infeasible to prepare sufficient training samples for those emerging categories.
Most of recognition systems are designed for a static closed world Primary assumption: all categories are known as priori.
Known
Train/Test
Known Unknown
Train
Test
Open Deep Network
Multi-class unknown category detection The multi-class triplet thresholding method
Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns
Training a triplet threshold [𝜂𝑖, 𝜇𝑖, 𝛿𝑖] per category
Applying the triplet threshold on each sample during the detection process
Define: [𝜂𝑖, 𝜇𝑖, 𝛿𝑖]Accept threshold:𝜂𝑖 = alpha ∗ 𝑀𝑒𝑎𝑛 σ𝑗=1
𝑋 𝑓𝑖,𝑗Reject threshold:𝜇𝑖 = beta ∗ 𝜂𝑖Distance threshold:𝛿𝑖 = sigma ∗ 𝑀𝑒𝑎𝑛(σ𝑗=1
𝑋 (𝑓𝑖,𝑗 − 𝑠𝑖,𝑗))
where:
𝑓𝑖,𝑗 is the maximal score of the i-th category
𝑠𝑖,𝑗 is the second maximal score of the i-th
category
Updating deep network Reconstruct the classification layer
37
wN+1 = wN+1′ + wN+1
′′
=σn=1N wn
N+
σm=1M wm
M
Transfer knowledge from the trained categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.
①
Open Deep Network
Weight matrix
Updated Weight matrix
New weight column for new category
Updating deep network Reconstruct the classification layer
38
wN+1 = wN+1′ + wN+1
′′
=σn=1N wn
N+
σm=1M wm
M
Transfer knowledge from the trained categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.
①
Open Deep Network
Weight matrix
Updated Weight matrix
New weight column for new category
Transfer knowledge from the similar categories :The similar categories should play a more critical role in the initialization.
②
Similar categories
Incremental training Balanced training strategy:Do guarantee that each of the known
categories have the same number of samples as the new category for fine-tuning to reduce jitter of the model.
Allometry training strategy :Adopt learning rate decay matrix, which varies between known categories and new categories to force new categories learn much faster than known categories during the fine-tuning.
Allometry training Allometry training factor:
Updating the weights:
𝑊𝑖,𝑗 = 𝑊𝑖,𝑗 − 𝛼𝑖𝜆𝜕
𝜕 𝑊𝑖,𝑗𝐽 𝑊, 𝑏
𝑏𝑖 = 𝑏𝑖 − 𝛼𝑖𝜆𝜕
𝜕 𝑏𝑖𝐽 𝑊, 𝑏
39
αi = ቐ0.1,i ≤ N
1, i > N
Open Deep Network
Open Deep Network
Comparing Our Initialization and Stochastic Initialization
The accuracies under different openness
Testing accuracies at each iteration The accuracies of known categories and unknown categories at each iteration
① ②
③ ④
Open Deep Network
Experiment results ODN VS closed-set action recognition
5.39 samples vs 94.4 samples
Comparable performance
Current Situation of Action Recognition
High computation cost of optical flow;
High computation cost of deep learning models. 3D ConvNet is much faster, but still very heavy.
Real-time capability of optical flows
Optical flow TVL1 FB S2D PCA DIS Flownet2 Brox LK
Accuracy of UCF101 split 1
72.98 70.58 65.34 69.6 63.28 57.71 72.35 37.51
Per frame time (340x256)
0.181 0.056 0.026 0.053 0.07 0.123(+-) 0.074 0.0034
fps 5.5 17.9 38.5 18.9 14.3 8.1 13.5 294.1
Speed and accuracy of several optical flow methods
Trend: Abandon optical flow
3D CNN
Year Model Method Inputdimension
2012 3D CNN Replace 2D conv with 3D conv 7x33x60x40
2015 C3D Deep 3D ConvNet 16x3x112x112
2016 ResNet3D 3D Resnet50 80x6x80x80
2017 I3D Inflate 2D kernels to 3D kernelsand copy weights along the 3rd
dimension
64x3x224x224
2017 NL I3D Non-local operation 32x8x224x224
2017 P3D Simulate 3x3x3 kernel with 1x3x3 and 3x1x1 kernels
16x3x160x160
com
plex
ity
I3D has 107.9B FLOPs
Google uses 64 GPUs to train I3D model
GPU selection?
I3D experiments based on TensorFlow The train/test code is available at
https://github.com/shiyemin/shuttleNet
Input size: 64x3x224x224
Batch Norm needs big enough batch size, which will consume a lot of GPU memory.
107.9B FLOPS: Need sufficient computational power to reduce training time.
GPU selection?
I3D experiments based on TensorFlow Note: We do not use the NVLink connections between GPUs.
Because of no further multi-gpu optimization, our results should be slower than others.
GPU #GPU Max Batch Size Train FPS Test FPS
K40 2 16 149.87 449.1
K80* 2 16 148.65 411.3
K80* 4 32 278.61 808.2
K80* 8 64 562.46 1649.4
K80* 16 112 1046.24 3298.8
P100 2 22 622.33 1692.5
P100 4 44 1212.01 3338
*Here, one K80 is one core of K80 card (each K80 card has two core).
Reference
Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good
practices for deep action recognition[C]//European Conference on Computer Vision.
Springer International Publishing, 2016: 20-36.
Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent
convolutional networks for visual recognition and description[C]//Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625-2634.
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets:
Deep networks for video classification[C]//Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2015: 4694-4702.
Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep
learning framework for video classification[C]//Proceedings of the 23rd ACM
international conference on Multimedia. ACM, 2015: 461-470.
Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J].
arXiv preprint arXiv:1511.04119, 2015.
Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM
approach[C]//Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th
International Conference on. IEEE, 2004, 3: 32-36.
Reddy K K, Shah M. Recognizing 50 human action categories of web videos[J].
Machine Vision and Applications, 2013, 24(5): 971-981.
Reference
Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York:
McGraw-hill, 2000.
Wang L M, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion
recognition[C]//Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2013: 2674-2681.
Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary
descriptors for action recognition[J]. International journal of computer vision, 2013,
103(1): 60-79.
Cai Z, Wang L, Peng X, et al. Multi-view super vector for action
recognition[C]//Proceedings of the IEEE conference on Computer Vision and
Pattern Recognition. 2014: 596-603.
Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action
recognition: Comprehensive study and good practice[J]. Computer Vision and
Image Understanding, 2016.
Liu L, Shao L, Rockett P. Boosted key-frame selection and correlated pyramidal
motion-feature representation for human action recognition[J]. Pattern recognition,
2013, 46(7): 1810-1818.
Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-
convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2015: 4305-4314.
Reference
Wang L, Qiao Y, Tang X. MoFAP: A multi-level representation for action
recognition[J]. International Journal of Computer Vision, 2015: 1-18.
Varol G, Laptev I, Schmid C. Long-term Temporal Convolutions for Action
Recognition[J]. arXiv preprint arXiv:1604.04494, 2016.
Wang L, Xiong Y, Wang Z, et al. Towards good practices for very deep two-stream
convnets[J]. arXiv preprint arXiv:1507.02159, 2015.