Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...

© Copyright 2018 Xilinx

Accelerating AI in DatacentersXilinx ML Suite

Kamran Khan

Sr. Product Manager, AI and Machine Learning


Deep Learning Models – A broad spectrum

>> 2

• Feature Extraction

• Object Detection

• Image Segmentation

Convolutional Neural Network

• Sequence and Temporal Data

• Speech to Text

• Language Translation

Recurrent Neural Network

• Classification

• Universal Function Approximator

• Autoencoder

Multi-Layer Perceptron

Object Detection SegmentationClassification

“Dog”


Xilinx ML Suite – DNN Processor + “Compiler, Quantizer, Runtime”

>> 3

vV v

Exec

utio

n C

ontr

olle

r

Spill

/ R

esto

re D

MA

Co

ntro

ller

Weights DMA Controller

Bias

ReLU

Bias

ReLU

Bias

ReLU

Bias

ReLU

Pooling Pooling Pooling Pooling

Image Queue

Instruction Buffer

Cross Bar

Pooling/EWA

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

X+

Wei

ght

s

Trained Model Compiler + Runtime Xilinx DNN Processor

60-80% EfficiencyLow Latency, High

Throughput – Batch 1

No FPGA Expertise

Required


Xilinx DNN Processor (xDNN)

>> 4

˃ Configurable Overlay Processor

˃ DNN Specific Instruction Set

Convolution, Max Pool etc.

˃ Any Network, Any Image Size

˃ High Frequency & High Compute

Efficiency

˃ Compile and run new networks

Execu

tion C

ontr

olle

r

Spill

/ R

esto

re D

MA

Contr

olle

rWeights DMA Controller

Systolic Array

Bias

ReLU

Bias

ReLU

Bias

ReLU

Bias

ReLU

Pooling Pooling Pooling Pooling

Image Queue

Instruction Buffer

Cross Bar

Pooling/

EWA


Rapid Feature and Performance Improvement

>> 5

xDNN-v1

Q4CY17

• Array of Accumulator

• Int16 (Batch=1) and Int8 (Batch=2) support

• Instructions: Convolution, ReLU, Pool, Elementwise

• Flexible kernel size(square) and strides

• 500 MHz

xDNN-v2

Q2CY18

• All xDNN-v1 Features

• DDR Caching: Larger Image size

• New Instructions: Depth-wise Convolution, De-convolution, Up-sampling

• Rectangular Kernels

• 500 MHz

xDNN-v3

Q4CY18

• New Systolic Array Implementation: 2.2x lower latency

• Instruction Level Parallelism – non-blocking data movement

• Batch=1 for Int8 – lower latency

• Feature compatible with xDNN-v2

• 720+ MHz


xDNN – Channel Parallel Systolic Array

>> 6

2D Channel Parallel

Datapath

&

Distributed

Weight Buffers

X

+

Weig

hts X

+

Weig

hts

X

+Wei

ghts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

X

+

Weig

hts X

+

Weig

hts

∑ ∑ ∑ ∑

Micro-Architecture Optimized for underlying Ultrascale+ FPGA Fabric


xDNN Processor – Tensor Memory

>> 7

WR

WR

WR

WR

From Systolic Array

From DDR DMA

To Systolic Array

To DDR DMA

To Parallel

Processing

From Parallel

Processing

Channel Parallel

Concurrent Access


xDNN Compiler + Runtime

>> 8

https://github.com/Xilinx/ml-suite

MxNet

CPU Layers FPGA Layers

RuntimeImage

Model Weights

Calibration Set

QuantizerCompiler

Tensor Graph Optimization

Framework Tensor Graph

to Xilinx Tensor Graph

Frontend

Deep Learning Frameworks


Graph Partitioning

>> 9

˃ One time loading of sub-graph instructions

˃ Data Flow Execution

Subgraph 1 Parallel Subgraphs Post-ProcessingPre-Processing

FPGA or CPU FPGA CPU CPUFPGA

1 Core -> Multi-Core -> Multi-Chip


Fusing of Operations

>> 10

˃ Fuse operations as pipe-lined stages

˃ Access activations only once

ReLU

Conv

Max

Pool

Batch

Norm.

Bias Conv.

Pipelined Operators(In-line, no RAM access)


Instruction Level Parallelism

>> 11

Previous Layer Output

3x3 Conv Reduce 5x5 Conv Reduce1x1 Conv

3x3 Conv 5x5 Conv

Concatenated Output

3x3 Max Pool

1x1 Conv

1x1 Conv

3x3 Conv Reduce

3x3 Max Pool

3x3 Conv

Parallel Execution 5x5 Conv Reduce

5x5 Conv

1x1 Conv

Time


Automatic Intra Layer Tiling

>> 12

˃ Tile when Feature Map size exceeds on-chip memory

˃ Work on full feature map depth

2 3

0 1

Any Feature Map Size


xfDNN Runtime Engine

˃ ML specific engine built on top of SDx runtime

˃ Lightweight and portable with no dependency on ML frameworks

˃ Extensive C++/Python API with simplified use model

˃ Asynchronous XDNN execution

˃ Streaming/pipeline/Dataflow support for large imageset

˃ Multiple CNN models running on single FPGA

˃ Multiple FPGA support

>> 13


Simple Usage

>> 14

Blocking

Non-Blocking 8-FPGA

© Copyright 2018 Xilinx>> 15

Server PlatformsIntel x86, AMD Epyc,

Power9, ARM

FaaS Xilinx BoardsAlveo U200

Alveo U250

https://github.com/Xilinx/ml-suite

*Under Development

*


xDNN v3 Implementation on Alveo U200

16

˃ 3 Large 96x16 PEs– 1 in each SLR – 5.2 ML Shell

˃ Kernels @ 720 MHz/360MHz

Resource Count Utilization

LUTs 658k 52%

DSPs 5661 80%

BRAM 1258 58%

URAM 864 92%


xDNN v3 Implementation on Alveo U250

>> 17

˃ 4 Large 96x16 PEs– 1 in each SLR – standard 5.2 Shell

˃ Kernels at 700 MHz/350 MHz

Resource Count Utilization

LUTs 876k 51%

DSPs 7548 62%

BRAM 1632 61%

URAM 1152 90%


xDNN GoogLeNet v1 Performance – Image Size 224x224

>> 18

2,542

3,124

3,389

4,127

1.18

1.87

1.18

1.82

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Alveo U200 Latency Mode (INT8) Alveo U200 Throughput Mode (INT8) Alveo U250 Latency Mode (INT8) Alveo U250 Throughput Mode (INT8)

Late

ncy (

ms)

Ima

ges/s


xDNN GoogLeNet v1 Energy Efficiency

>> 19

35.5

37.5

34

35

36

37

38

Alveo U200 Throughput Mode (INT8) Alveo U250 Throughput Mode (INT8)

Imag

es/s

/w


xDNN YOLO v2 Performance – Image Size 608x608

>> 20

88

11734

34

0

5

10

15

20

25

30

35

40

0

20

40

60

80

100

120

140

Late

ncy

(ms

)

Alveo U200 Latency Mode (INT8) Alveo U250 Latency Mode (INT8)

Imag

es/s


Custom Deep Learning Pipeline

>> 21

xDNN

xDNN

xDNN

XDNN

Video

Decode +

Processing

Video

Processing

+ Encode

Video + ML

Genomics + ML

Risk Modelling + ML

Database + ML

Network IPS + ML

Storage + ML

Integrate Custom Applications with xDNN. Lower end-to-end latency


Additional Information

>> 22

˃ Visit Alveo Demo Room

˃ Refer to White Paper WP504

“Accelerating DNNs with Xilinx Alveo Accelerator Cards”

˃ https://www.xilinx.com/applications/megatrends/ma

chine-learning.html

https://www.xilinx.com/applications/megatrends/machine-learning.html


Adaptable.

Intelligent.

Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...

Documents

Transcript of Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...