Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang,...

22
Deep Learning on GPU Clusters Bryan Catanzaro

Transcript of Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang,...

Page 1: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Deep Learning on GPU Clusters

Bryan Catanzaro

Page 2: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Bryan Catanzaro

Machine Learning

• ML runs many things these days

– Ad placement / Product Recommendations

– Web search / image search

– Speech recognition / machine translation

– Autonomous driving

– …

Bryan Catanzaro

Page 3: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Machine learning in practice “Mug”

Machine Learning (Classifier)

Feature Extraction

Prior Knowledge Experience

Adam Coates

Page 4: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

How does ML work?

Classifier Feature

Extraction “Mug”?

Update model

Pixels

Adam Coates

Page 5: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Learning features

• Can we learn “features” from data?

Classifier Feature

Extraction “Mug”?

Update entire stack to make better predictions.

AKA: “Deep learning”

Pixels

Adam Coates

Page 6: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Learning features

• Deep learning: learn multiple stages of features to achieve end goal.

Features Features “Mug”? Classifier

Pixels

Adam Coates

Page 7: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Progress in AI

Idea

Code

Train

Test

Deep Learning brings special opportunities and challenges

Iteration latency gates

progress

Fastest time from idea to

tested model

Adam Coates

Page 8: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

The need for scaling

• Dist-Belief [Le et al., ICML 2012]: Up to 1.7 billion parameter networks.

• Unsupervised learning algorithm with > 1 billion parameters able to discover “objects” in high-res images.

1000 machines for 1 week. (16000 cores.)

Faces: Bodies: Cats:

[Also: Dean et al., NIPS 2012]

Page 9: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

What will the rest of us do??

• Millions of $$ hardware.

• Extensive engineering to handle node failures and network traffic.

• Hard to scale beyond data center. – …if you had a data center.

Page 10: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Two ways to scale neural networks

• Simple solution: “data parallelism” – Parallelize over images in a batch.

Need to synchronize model across machines.

Difficult to fit big models on GPUs.

Machine 1

Image 1

Machine 2

Image 2

Sync.

Page 11: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Two ways to scale neural networks

• “Model parallelism” – Parallelize over neurons. (Relies on local connectivity.)

Scales to much larger models. Much more frequent synchronization.

Machine 1 Machine 2

Image 1

Page 12: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

The network bottleneck

• On a typical Ethernet cluster: – Data parallelism:

• Synchronize a 1B parameter model = 30 seconds.

– Model parallelism: • Move 1MB of neurons for 100 images = 0.8 seconds

– Must do this for every layer.

– Typically >>10 times slower than computation.

• Problem: communication makes distribution very inefficient for large neural nets. – How do we scale out efficiently??

Page 13: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

COTS HPC Hardware

• Infiniband:

– FDR Infiniband switch.

– 1 network adapter per server.

56 Gbps; microsecond latency.

• GTX 680 GPUs

– 4 GPUs per server.

> 1 TFLOPS each for ideal workload.

Page 14: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Model parallelism in MPI

MPI starts a single process for each GPU. – Enables message passing, but this is surprisingly unnatural.

GPU 1 GPU 2

Image 1

W1 W2

Page 15: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Model parallelism in MPI

MPI starts a single process for each GPU. – Enables message passing, but this is surprisingly unnatural.

GPU 1 GPU 2

Image 1

W1 W2

Page 16: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

HPC Software Infrastructure: Communication

• Moving neuron responses around is confusing.

– Hide communication inside “distributed array”.

GPU 1

GPU 3

GPU 2

GPU 4

Page 17: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

HPC Software Infrastructure: Communication

• After some hidden communication, GPU 2 has all the input data it needs.

– GPU code not much different from 1 GPU.

GPU 2

Page 18: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Results: Scaling

• Implemented 9-layer network from Le et al., 2012.

– 3 stacks of sparse auto-encoder, pooling, LCN.

– Compute “fine-tuning” gradient.

• Up to 11.2B parameter networks. – Update time similar to 185M parameter network on 1 GPU.

00.10.20.30.40.50.60.70.80.9

1

1 4 9 16 36 64

Ite

rati

on

Tim

e(s

)

# GPUs

11.2B

6.9B

3.0B

1.9B

680M

185M

Page 19: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro

Results: Scaling

• Up to 47x increase in throughput:

1

2

4

8

16

32

64

1 4 9 16 36 64

Fact

or

Spe

ed

up

# GPUs

11.2B

6.9B

3.0B

1.9B

680M

185M

Linear

Page 20: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

cuDNN

• Deep Neural Networks rely heavily on BLAS – Basic Linear Algebra Subroutines

• However, there are some kernels unique to DNNs – Such as convolutions

• cuDNN is a GPU library that provides these kernels

• Available at https://developer.nvidia.com/cudnn

Bryan Catanzaro

Page 21: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Using Caffe with cuDNN

• Accelerate Caffe layer types by 1.2 – 3x

• On average, 36% faster overall for training on Alexnet

• Integrated into Caffe dev branch – Official release soon

Caffe (CPU*)

1x

Caffe (GPU) 11x

Caffe (cuDNN)

14x

Baseline Caffe compared to

Caffe accelerated by cuDNN on

K40

Overall AlexNet training time

*CPU is 24 core E5-2697v2 @

2.4GHz Intel MKL 11.1.3 Bryan Catanzaro

Page 22: Deep Learning on GPU Clusters - on-demand.gputechconf.com · Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng and Bryan Catanzaro Two ways to scale neural networks •Simple

Conclusion

• Deep Learning is increasingly important to AI

• HPC is key to Deep Learning

• Interested in applying your HPC skills to AI? Talk to us!

[email protected]