Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...

23
© Copyright 2018 Xilinx Accelerating AI in Datacenters Xilinx ML Suite Kamran Khan Sr. Product Manager, AI and Machine Learning

Transcript of Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...

Page 1: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Accelerating AI in DatacentersXilinx ML Suite

Kamran Khan

Sr. Product Manager, AI and Machine Learning

Page 2: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Deep Learning Models – A broad spectrum

>> 2

• Feature Extraction

• Object Detection

• Image Segmentation

Convolutional Neural Network

• Sequence and Temporal Data

• Speech to Text

• Language Translation

Recurrent Neural Network

• Classification

• Universal Function Approximator

• Autoencoder

Multi-Layer Perceptron

Object Detection SegmentationClassification

“Dog”

Page 3: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Xilinx ML Suite – DNN Processor + “Compiler, Quantizer, Runtime”

>> 3

vV v

Exec

utio

n C

ontr

olle

r

Spill

/ R

esto

re D

MA

Co

ntro

ller

Weights DMA Controller

Bias

ReLU

Bias

ReLU

Bias

ReLU

Bias

ReLU

Pooling Pooling Pooling Pooling

Image Queue

Instruction Buffer

Cross Bar

Pooling/EWA

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

Trained Model Compiler + Runtime Xilinx DNN Processor

60-80% EfficiencyLow Latency, High

Throughput – Batch 1

No FPGA Expertise

Required

Page 4: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Xilinx DNN Processor (xDNN)

>> 4

˃ Configurable Overlay Processor

˃ DNN Specific Instruction Set

Convolution, Max Pool etc.

˃ Any Network, Any Image Size

˃ High Frequency & High Compute

Efficiency

˃ Compile and run new networks

Execu

tion C

ontr

olle

r

Spill

/ R

esto

re D

MA

Contr

olle

rWeights DMA Controller

Systolic Array

Bias

ReLU

Bias

ReLU

Bias

ReLU

Bias

ReLU

Pooling Pooling Pooling Pooling

Image Queue

Instruction Buffer

Cross Bar

Pooling/

EWA

Page 5: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Rapid Feature and Performance Improvement

>> 5

xDNN-v1

Q4CY17

• Array of Accumulator

• Int16 (Batch=1) and Int8 (Batch=2) support

• Instructions: Convolution, ReLU, Pool, Elementwise

• Flexible kernel size(square) and strides

• 500 MHz

xDNN-v2

Q2CY18

• All xDNN-v1 Features

• DDR Caching: Larger Image size

• New Instructions: Depth-wise Convolution, De-convolution, Up-sampling

• Rectangular Kernels

• 500 MHz

xDNN-v3

Q4CY18

• New Systolic Array Implementation: 2.2x lower latency

• Instruction Level Parallelism – non-blocking data movement

• Batch=1 for Int8 – lower latency

• Feature compatible with xDNN-v2

• 720+ MHz

Page 6: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN – Channel Parallel Systolic Array

>> 6

2D Channel Parallel

Datapath

&

Distributed

Weight Buffers

X

+

Weig

hts X

+

Weig

hts

X

+Wei

ghts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

∑ ∑ ∑ ∑

Micro-Architecture Optimized for underlying Ultrascale+ FPGA Fabric

Page 7: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN Processor – Tensor Memory

>> 7

WR

WR

WR

WR

From Systolic Array

From DDR DMA

To Systolic Array

To DDR DMA

To Parallel

Processing

From Parallel

Processing

Channel Parallel

Concurrent Access

Page 8: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN Compiler + Runtime

>> 8

https://github.com/Xilinx/ml-suite

MxNet

CPU Layers FPGA Layers

RuntimeImage

Model Weights

Calibration Set

QuantizerCompiler

Tensor Graph Optimization

Framework Tensor Graph

to Xilinx Tensor Graph

Frontend

Deep Learning Frameworks

Page 9: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Graph Partitioning

>> 9

˃ One time loading of sub-graph instructions

˃ Data Flow Execution

Subgraph 1 Parallel Subgraphs Post-ProcessingPre-Processing

FPGA or CPU FPGA CPU CPUFPGA

1 Core -> Multi-Core -> Multi-Chip

Page 10: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Fusing of Operations

>> 10

˃ Fuse operations as pipe-lined stages

˃ Access activations only once

ReLU

Conv

Max

Pool

Batch

Norm.

Bias Conv.

Pipelined Operators(In-line, no RAM access)

Page 11: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Instruction Level Parallelism

>> 11

Previous Layer Output

3x3 Conv Reduce 5x5 Conv Reduce1x1 Conv

3x3 Conv 5x5 Conv

Concatenated Output

3x3 Max Pool

1x1 Conv

1x1 Conv

3x3 Conv Reduce

3x3 Max Pool

3x3 Conv

Parallel Execution 5x5 Conv Reduce

5x5 Conv

1x1 Conv

Time

Page 12: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Automatic Intra Layer Tiling

>> 12

˃ Tile when Feature Map size exceeds on-chip memory

˃ Work on full feature map depth

2 3

0 1

Any Feature Map Size

Page 13: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xfDNN Runtime Engine

˃ ML specific engine built on top of SDx runtime

˃ Lightweight and portable with no dependency on ML frameworks

˃ Extensive C++/Python API with simplified use model

˃ Asynchronous XDNN execution

˃ Streaming/pipeline/Dataflow support for large imageset

˃ Multiple CNN models running on single FPGA

˃ Multiple FPGA support

>> 13

Page 14: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Simple Usage

>> 14

Blocking

Non-Blocking 8-FPGA

Page 15: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx>> 15

Server PlatformsIntel x86, AMD Epyc,

Power9, ARM

FaaS Xilinx BoardsAlveo U200

Alveo U250

https://github.com/Xilinx/ml-suite

*Under Development

*

Page 16: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN v3 Implementation on Alveo U200

16

˃ 3 Large 96x16 PEs– 1 in each SLR – 5.2 ML Shell

˃ Kernels @ 720 MHz/360MHz

Resource Count Utilization

LUTs 658k 52%

DSPs 5661 80%

BRAM 1258 58%

URAM 864 92%

Page 17: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN v3 Implementation on Alveo U250

>> 17

˃ 4 Large 96x16 PEs– 1 in each SLR – standard 5.2 Shell

˃ Kernels at 700 MHz/350 MHz

Resource Count Utilization

LUTs 876k 51%

DSPs 7548 62%

BRAM 1632 61%

URAM 1152 90%

Page 18: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN GoogLeNet v1 Performance – Image Size 224x224

>> 18

2,542

3,124

3,389

4,127

1.18

1.87

1.18

1.82

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Alveo U200 Latency Mode (INT8) Alveo U200 Throughput Mode (INT8) Alveo U250 Latency Mode (INT8) Alveo U250 Throughput Mode (INT8)

Late

ncy (

ms)

Ima

ges/s

Page 19: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN GoogLeNet v1 Energy Efficiency

>> 19

35.5

37.5

34

35

36

37

38

Alveo U200 Throughput Mode (INT8) Alveo U250 Throughput Mode (INT8)

Imag

es/s

/w

Page 20: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

xDNN YOLO v2 Performance – Image Size 608x608

>> 20

88

11734

34

0

5

10

15

20

25

30

35

40

0

20

40

60

80

100

120

140

Late

ncy

(ms

)

Alveo U200 Latency Mode (INT8) Alveo U250 Latency Mode (INT8)

Imag

es/s

Page 21: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Custom Deep Learning Pipeline

>> 21

xDNN

xDNN

xDNN

XDNN

Video

Decode +

Processing

Video

Processing

+ Encode

Video + ML

Genomics + ML

Risk Modelling + ML

Database + ML

Network IPS + ML

Storage + ML

Integrate Custom Applications with xDNN. Lower end-to-end latency

Page 22: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Additional Information

>> 22

˃ Visit Alveo Demo Room

˃ Refer to White Paper WP504

“Accelerating DNNs with Xilinx Alveo Accelerator Cards”

˃ https://www.xilinx.com/applications/megatrends/ma

chine-learning.html

Page 23: Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML Suite Author: Jeffrey Myers Created Date: 12/18/2018 1:45:54 PM

© Copyright 2018 Xilinx

Adaptable.

Intelligent.