Intro to deep learning with GPUs
Transcript of Intro to deep learning with GPUs
![Page 1: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/1.jpg)
March 2016
Deep Learning on GPUs
![Page 2: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/2.jpg)
2
AGENDA
What is Deep Learning?
GPUs and DL
DL in practice
Scaling up DL
![Page 3: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/3.jpg)
3
What is Deep Learning?
![Page 4: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/4.jpg)
4
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image ClassificationSpeech Recognition
Language TranslationLanguage ProcessingSentiment AnalysisRecommendation
MEDIA & ENTERTAINMENT
Video CaptioningVideo Search
Real Time Translation
AUTONOMOUS MACHINES
Pedestrian DetectionLane Tracking
Recognize Traffic Sign
SECURITY & DEFENSE
Face DetectionVideo SurveillanceSatellite Imagery
MEDICINE & BIOLOGY
Cancer Cell DetectionDiabetic GradingDrug Discovery
![Page 5: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/5.jpg)
5
Traditional machine perceptionHand crafted feature extractors
Speaker ID, speech transcription, …
Raw data Feature extraction ResultClassifier/detector
SVM, shallow neural net,
…
HMM, shallow neural net,
…
Clustering, HMM,LDA, LSA
…
Topic classification, machine translation, sentiment analysis…
![Page 6: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/6.jpg)
6
Deep learning approach
Deploy:
Dog
Cat
Honey badger
Errors
DogCat
Raccoon
Dog
Train:
MODEL
MODEL
![Page 7: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/7.jpg)
7
Artificial neural networkA collection of simple, trainable mathematical units that collectively learn complex functions
Input layer Output layer
Hidden layers
Given sufficient training data an artificial neural network can approximate very complexfunctions mapping raw data to output decisions
![Page 8: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/8.jpg)
8
Artificial neurons
From Stanford cs231n lecture notes
Biological neuron
w1 w2 w3
x1 x2 x3
y
y=F(w1x1+w2x2+w3x3)
F(x)=max(0,x)
Artificial neuron
![Page 9: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/9.jpg)
9
Deep neural network (dnn)
Input Result
Application components:
Task objectivee.g. Identify face
Training data10-100M images
Network architecture~10 layers1B parameters
Learning algorithm~30 Exaflops~30 GPU days
Raw data Low-level features Mid-level features High-level features
![Page 10: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/10.jpg)
10
Deep learning benefits
§ Robust
§ No need to design the features ahead of time – features are automatically learned to be optimal for the task at hand
§ Robustness to natural variations in the data is automatically learned
§ Generalizable
§ The same neural net approach can be used for many different applications and data types
§ Scalable
§ Performance improves with more data, method is massively parallelizable
![Page 11: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/11.jpg)
11
Baidu Deep Speech 2
English and Mandarin speech recognition
Transition from English to Mandarin made simpler by end-to-end DL
No feature engineering or Mandarin-specifics required
More accurate than humans
Error rate 3.7% vs. 4% for human tests
http://svail.github.io/mandarin/
http://arxiv.org/abs/1512.02595
End-to-end Deep Learning for English and Mandarin Speech Recognition
![Page 12: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/12.jpg)
12
AlphaGo
Training DNNs: 3 weeks, 340 million training steps on 50 GPUs
Play: Asynchronous multi-threaded search
Simulations on CPUs, policy and value DNNs in parallel on GPUs
Single machine: 40 search threads, 48 CPUs, and 8 GPUs
Distributed version: 40 search threads, 1202 CPUs and 176 GPUs
Outcome: Beat both European and World Go champions in best of 5 matches
First Computer Program to Beat a Human Go Professional
http://www.nature.com/nature/journal/v529/n7587/full/nature16961.htmlhttp://deepmind.com/alpha-go.html
![Page 13: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/13.jpg)
13
Deep Learning for Autonomous vehicles
![Page 14: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/14.jpg)
14
Deep Learning Synthesis
Texture synthesis and transfer using CNNs. Timo Aila et al., NVIDIA Research
![Page 15: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/15.jpg)
15
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2009 2010 2011 2012 2013 2014 2015 2016
THE AI RACE IS ON
IBM Watson Achieves Breakthrough in Natural Language Processing
Facebook Launches Big Sur
Baidu Deep Speech 2Beats Humans
Google Launches TensorFlow
Microsoft & U. Science & Tech, China Beat Humans on IQ
Toyota Invests $1B in AI Labs
IMAGENETAccuracy Rate
Traditional CV Deep Learning
![Page 16: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/16.jpg)
16
The Big Bang in Machine Learning
“ Google’s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs… And it depends on these chips more than the larger tech universe realizes.”
DNN GPUBIG DATA
![Page 17: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/17.jpg)
17
GPUs and DL
USE MORE PROCESSORS TO GO FASTER
![Page 18: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/18.jpg)
18
Deep learning development cycle
![Page 19: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/19.jpg)
19
Three Kinds of Networks
DNN – all fully connected layers
CNN – some convolutional layers
RNN – recurrent neural network, LSTM
![Page 20: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/20.jpg)
20
DNNKey operation is dense M x V
Backpropagation uses dense matrix-matrix multiply starting from softmax scores
![Page 21: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/21.jpg)
21
DNNBatching for training and latency insensitive. M x M
Batched operation is M x M – gives re-use of weights.
Without batching, would use each element of Weight matrix once.
Want 10-50 arithmetic operations per memory fetch for modern compute architectures.
![Page 22: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/22.jpg)
22
CNNRequires convolution and M x V
Filters conserved through plane
Multiply limited – even without batching.
![Page 23: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/23.jpg)
23
Other OperationsTo finish building a DNN
These are not limiting factors with appropriate GPU use
Complex networks have hundreds of millions of weights.
![Page 24: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/24.jpg)
24
Lots of Parallelism Available in a DNN
![Page 25: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/25.jpg)
25
TESLA M40World’s Fastest Accelerator for Deep Learning Training
0 1 2 3 4 5 6 7 8 9 10 11 12 13
GPU Server with 4x TESLA M40
Dual CPU Server
13x Faster TrainingCaffe
Number of Days
CUDA Cores 3072
Peak SP 7 TFLOPS
GDDR5 Memory 12 GB
Bandwidth 288 GB/s
Power 250W
Reduce Training Time from 13 Days to just 1 Day
Note: Caffe benchmark with AlexNet,CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04
28 Gflop/W
![Page 26: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/26.jpg)
26
Comparing CPU and GPU – server classXeon E5-2698 and Tesla M40
NVIDIA Whitepaper “GPU based deep learning inference: A performance and power analysis.”
![Page 27: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/27.jpg)
27
DL in practice
![Page 28: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/28.jpg)
28
EDUCATION
START-UPS
CNTKTENSORFLOW
DL4J
The Engine of Modern AI
NVIDIA GPU PLATFORM
*U. Washington, CMU, Stanford, TuSimple, NYU, Microsoft, U. Alberta, MIT, NYU Shanghai
VITRUVIANSCHULTS
LABORATORIES
TORCH
THEANO
CAFFE
MATCONVNET
PURINEMOCHA.JL
MINERVA MXNET*
CHAINER
BIG SUR WATSON
OPENDEEPKERAS
![Page 29: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/29.jpg)
29
CUDA for Deep Learning Development
TITAN X GPU CLOUDDEVBOX
DEEP LEARNING SDK
cuSPARSE cuBLASDIGITS NCCLcuDNN
![Page 30: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/30.jpg)
30
§ GPU-accelerated Deep Learning subroutines
§ High performance neural network training
§ Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch, TensorFlow
§ Up to 3.5x faster AlexNet training in Caffe than baseline GPU
Millions of Images Trained Per Day
Tiled FFT up to 2x faster than FFT
developer.nvidia.com/cudnn
0
20
40
60
80
100
cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
Deep Learning Primitives
AcceleratingArtificial Intelligence
![Page 31: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/31.jpg)
31
CUDA BOOSTS DEEP LEARNING
5X IN 2 YEARSPe
rfor
man
ce
AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04
Caffe Performance
K40K40+cuDNN1
M40+cuDNN3
M40+cuDNN4
0
1
2
3
4
5
6
11/2013 9/2014 7/2015 12/2015
![Page 32: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/32.jpg)
32
NVIDIA DIGITSInteractive Deep Learning GPU Training System
Test ImageTest Image
Monitor ProgressConfigure DNNProcess Data Visualize Layers
developer.nvidia.com/digits
![Page 33: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/33.jpg)
33
PC GAMING
ONE ARCHITECTURE — END-TO-END AI
DRIVE PXfor Auto
Titan Xfor PC
Teslafor Cloud
Jetsonfor Embedded
![Page 34: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/34.jpg)
34
Scaling DL
![Page 35: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/35.jpg)
35
Scaling Neural NetworksData Parallelism
Machine 1
Image1
Machine2
Image2
Sync.
Notes:Need to sync model across machines.Largest models do not fit on one GPU.Requires P-fold larger batch size.Works across many nodes – parameter server approach – linear speedup.
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro
W W
![Page 36: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/36.jpg)
36
Multiple GPUsNear linear scaling – data parallel.
Ren Wu et al, Baidu, “Deep Image: Scaling up Image Recognition.” arXiv 2015
![Page 37: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/37.jpg)
37
Scaling Neural NetworksModel Parallelism
Machine1 Machine2
Image1
W
Notes:Allows for larger models than fit on one GPU.Requires much more frequent communication between GPUs.Most commonly used within a node – GPU P2P.Effective for the fully connected layers.
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro
![Page 38: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/38.jpg)
38
Scaling Neural NetworksHyper Parameter Parallelism
Try many alternative neural networks in parallel – on different CPU / GPU / Machines.Probably the most obvious and effective way!
![Page 39: Intro to deep learning with GPUs](https://reader036.fdocuments.us/reader036/viewer/2022062504/58a1aeb91a28abd94d8c4a1b/html5/thumbnails/39.jpg)
39
Deep Learning Everywhere
NVIDIA Titan X
NVIDIA Jetson
NVIDIA Tesla
NVIDIA DRIVE PX
Contact: [email protected]