GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*)...

23
GPUs for Online Deep Learning Applications Chris Fougner

Transcript of GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*)...

Page 1: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

GPUs for Online Deep Learning Applications

Chris Fougner

Page 2: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Content

• Deploying a streaming speech recognition service

• GPU deployments within Baidu

Page 3: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Missing Content

• Songbai Pu

• FPGA vs. GPU discussion

Page 4: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Speech Recognition

你好

Page 5: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Breaking down an utterance

• "Take me to Philz Coffee on Middlefield"

Page 6: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Breaking down an utterance• Recent state of the art neural networks for speech

recognition have ~200M parameters

• Translates to ~50B FLOPs for a 2.53s utterance, or 20 GFLOP per second of audio

• Users want a response in ~100ms

Page 7: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

50B FLOPs in a datacenter• Doesn't smell like a typical datacenter application

Page 8: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Typical datacenter model• Typically setup yields 2-3 concurrent users

Page 9: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Borrow tricks from training

• Use GPUs?

• Batch utterances?

Page 10: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

CPU vs GPUCPU (Intel E5-2660 v3) GPU (Nvidia K1200*)

TDP 105 W 45 WPrice $1500 USD $300 USDPeak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPsMemory Bandwidth 68 GB/s 80 GB/sMax Units / Server 2 4-8Float 16-bit libraries No Yes

*or Tesla M4 shortly.

Page 11: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

K1200 GPU server• Naive approach, directly replace CPU with GPU,

we get 2x users per serverU

sers

Per

Ser

ver

0

12

3

45

6

7

8

E5-2660 v3 K1200

Page 12: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Batching

W x h

=*

W X H

=*...

Page 13: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Is batching feasible?Time

Use

r

Page 14: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Batch Dispatch

Page 15: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

GPU + Batch Dispatch• With GPU + Batch Dispatch 10x throughput over

naive CPU.

Use

rs p

er s

erve

r

0

5

10

15

20

25

30

35

E5-2660 v3 K1200 K1200 + Batch Dispatch

Page 16: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Impact on latency?98% Latency (ms)

0

75

150

225

300

Concurrent Users

0 5 10 15 20 25 30

Single BatchBatch Dispatch

Page 17: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Borrow code from training?

• Bonus: Highly optimized code shared between in research and production. Huge productivity boost

• Eg. Switching from LSTM to GRU models in production code < 1 day

Page 18: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Baidu GPU deployment

• Image classification

• Machine translation

Page 19: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Image classification• Feature extraction for image classification uses

neural networks

Queries per second

0

25

50

75

100

125

150

E5-2620 v2 K1200

Latency (ms)

0

50

100

150

200

E5-2620 v2 K1200

Page 20: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Machine translation• translate.baidu.com uses neural network

Queries per second

012345678

E5-2620 v2 K1200

Latency (ms)

0

100

200

300

400

E5-2620 v2 K1200

Page 21: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Conclusions

• GPUs are an efficient way to boost performance and decrease latency of floating point intensive tasks in production

• Use Batch Dispatch to increase throughput

• GPUs for neural network applications allow you to share code

Page 22: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Mention

• Songbai Pu and Zhiqian Wang from Baidu China

Page 23: GPUs for Online Deep Learning Applications · CPU vs GPU CPU (Intel E5-2660 v3) GPU (Nvidia K1200*) TDP 105 W 45 W Price $1500 USD $300 USD Peak FMA FLOPs 0.4 TFLOPs 1.1 TFLOPs Memory

Thank you!

• Questions?