Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...
Transcript of Accelerating AI in Datacenters · 2020. 9. 4. · Title: Accelerating AI in Datacenters: Xilinx ML...
© Copyright 2018 Xilinx
Accelerating AI in DatacentersXilinx ML Suite
Kamran Khan
Sr. Product Manager, AI and Machine Learning
© Copyright 2018 Xilinx
Deep Learning Models – A broad spectrum
>> 2
• Feature Extraction
• Object Detection
• Image Segmentation
Convolutional Neural Network
• Sequence and Temporal Data
• Speech to Text
• Language Translation
Recurrent Neural Network
• Classification
• Universal Function Approximator
• Autoencoder
Multi-Layer Perceptron
Object Detection SegmentationClassification
“Dog”
© Copyright 2018 Xilinx
Xilinx ML Suite – DNN Processor + “Compiler, Quantizer, Runtime”
>> 3
vV v
Exec
utio
n C
ontr
olle
r
Spill
/ R
esto
re D
MA
Co
ntro
ller
Weights DMA Controller
Bias
ReLU
Bias
ReLU
Bias
ReLU
Bias
ReLU
Pooling Pooling Pooling Pooling
Image Queue
Instruction Buffer
Cross Bar
Pooling/EWA
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
Trained Model Compiler + Runtime Xilinx DNN Processor
60-80% EfficiencyLow Latency, High
Throughput – Batch 1
No FPGA Expertise
Required
© Copyright 2018 Xilinx
Xilinx DNN Processor (xDNN)
>> 4
˃ Configurable Overlay Processor
˃ DNN Specific Instruction Set
Convolution, Max Pool etc.
˃ Any Network, Any Image Size
˃ High Frequency & High Compute
Efficiency
˃ Compile and run new networks
Execu
tion C
ontr
olle
r
Spill
/ R
esto
re D
MA
Contr
olle
rWeights DMA Controller
Systolic Array
Bias
ReLU
Bias
ReLU
Bias
ReLU
Bias
ReLU
Pooling Pooling Pooling Pooling
Image Queue
Instruction Buffer
Cross Bar
Pooling/
EWA
© Copyright 2018 Xilinx
Rapid Feature and Performance Improvement
>> 5
xDNN-v1
Q4CY17
• Array of Accumulator
• Int16 (Batch=1) and Int8 (Batch=2) support
• Instructions: Convolution, ReLU, Pool, Elementwise
• Flexible kernel size(square) and strides
• 500 MHz
xDNN-v2
Q2CY18
• All xDNN-v1 Features
• DDR Caching: Larger Image size
• New Instructions: Depth-wise Convolution, De-convolution, Up-sampling
• Rectangular Kernels
• 500 MHz
xDNN-v3
Q4CY18
• New Systolic Array Implementation: 2.2x lower latency
• Instruction Level Parallelism – non-blocking data movement
• Batch=1 for Int8 – lower latency
• Feature compatible with xDNN-v2
• 720+ MHz
© Copyright 2018 Xilinx
xDNN – Channel Parallel Systolic Array
>> 6
2D Channel Parallel
Datapath
&
Distributed
Weight Buffers
X
+
Weig
hts X
+
Weig
hts
X
+Wei
ghts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
∑ ∑ ∑ ∑
Micro-Architecture Optimized for underlying Ultrascale+ FPGA Fabric
© Copyright 2018 Xilinx
xDNN Processor – Tensor Memory
>> 7
WR
WR
WR
WR
From Systolic Array
From DDR DMA
To Systolic Array
To DDR DMA
To Parallel
Processing
From Parallel
Processing
Channel Parallel
Concurrent Access
© Copyright 2018 Xilinx
xDNN Compiler + Runtime
>> 8
https://github.com/Xilinx/ml-suite
MxNet
CPU Layers FPGA Layers
RuntimeImage
Model Weights
Calibration Set
QuantizerCompiler
Tensor Graph Optimization
Framework Tensor Graph
to Xilinx Tensor Graph
Frontend
Deep Learning Frameworks
© Copyright 2018 Xilinx
Graph Partitioning
>> 9
˃ One time loading of sub-graph instructions
˃ Data Flow Execution
Subgraph 1 Parallel Subgraphs Post-ProcessingPre-Processing
FPGA or CPU FPGA CPU CPUFPGA
1 Core -> Multi-Core -> Multi-Chip
© Copyright 2018 Xilinx
Fusing of Operations
>> 10
˃ Fuse operations as pipe-lined stages
˃ Access activations only once
ReLU
Conv
Max
Pool
Batch
Norm.
Bias Conv.
Pipelined Operators(In-line, no RAM access)
© Copyright 2018 Xilinx
Instruction Level Parallelism
>> 11
Previous Layer Output
3x3 Conv Reduce 5x5 Conv Reduce1x1 Conv
3x3 Conv 5x5 Conv
Concatenated Output
3x3 Max Pool
1x1 Conv
1x1 Conv
3x3 Conv Reduce
3x3 Max Pool
3x3 Conv
Parallel Execution 5x5 Conv Reduce
5x5 Conv
1x1 Conv
Time
© Copyright 2018 Xilinx
Automatic Intra Layer Tiling
>> 12
˃ Tile when Feature Map size exceeds on-chip memory
˃ Work on full feature map depth
2 3
0 1
Any Feature Map Size
© Copyright 2018 Xilinx
xfDNN Runtime Engine
˃ ML specific engine built on top of SDx runtime
˃ Lightweight and portable with no dependency on ML frameworks
˃ Extensive C++/Python API with simplified use model
˃ Asynchronous XDNN execution
˃ Streaming/pipeline/Dataflow support for large imageset
˃ Multiple CNN models running on single FPGA
˃ Multiple FPGA support
>> 13
© Copyright 2018 Xilinx
Simple Usage
>> 14
Blocking
Non-Blocking 8-FPGA
© Copyright 2018 Xilinx>> 15
Server PlatformsIntel x86, AMD Epyc,
Power9, ARM
FaaS Xilinx BoardsAlveo U200
Alveo U250
https://github.com/Xilinx/ml-suite
*Under Development
*
© Copyright 2018 Xilinx
xDNN v3 Implementation on Alveo U200
16
˃ 3 Large 96x16 PEs– 1 in each SLR – 5.2 ML Shell
˃ Kernels @ 720 MHz/360MHz
Resource Count Utilization
LUTs 658k 52%
DSPs 5661 80%
BRAM 1258 58%
URAM 864 92%
© Copyright 2018 Xilinx
xDNN v3 Implementation on Alveo U250
>> 17
˃ 4 Large 96x16 PEs– 1 in each SLR – standard 5.2 Shell
˃ Kernels at 700 MHz/350 MHz
Resource Count Utilization
LUTs 876k 51%
DSPs 7548 62%
BRAM 1632 61%
URAM 1152 90%
© Copyright 2018 Xilinx
xDNN GoogLeNet v1 Performance – Image Size 224x224
>> 18
2,542
3,124
3,389
4,127
1.18
1.87
1.18
1.82
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Alveo U200 Latency Mode (INT8) Alveo U200 Throughput Mode (INT8) Alveo U250 Latency Mode (INT8) Alveo U250 Throughput Mode (INT8)
Late
ncy (
ms)
Ima
ges/s
© Copyright 2018 Xilinx
xDNN GoogLeNet v1 Energy Efficiency
>> 19
35.5
37.5
34
35
36
37
38
Alveo U200 Throughput Mode (INT8) Alveo U250 Throughput Mode (INT8)
Imag
es/s
/w
© Copyright 2018 Xilinx
xDNN YOLO v2 Performance – Image Size 608x608
>> 20
88
11734
34
0
5
10
15
20
25
30
35
40
0
20
40
60
80
100
120
140
Late
ncy
(ms
)
Alveo U200 Latency Mode (INT8) Alveo U250 Latency Mode (INT8)
Imag
es/s
© Copyright 2018 Xilinx
Custom Deep Learning Pipeline
>> 21
xDNN
xDNN
xDNN
XDNN
Video
Decode +
Processing
Video
Processing
+ Encode
Video + ML
Genomics + ML
Risk Modelling + ML
Database + ML
Network IPS + ML
Storage + ML
Integrate Custom Applications with xDNN. Lower end-to-end latency
© Copyright 2018 Xilinx
Additional Information
>> 22
˃ Visit Alveo Demo Room
˃ Refer to White Paper WP504
“Accelerating DNNs with Xilinx Alveo Accelerator Cards”
˃ https://www.xilinx.com/applications/megatrends/ma
chine-learning.html
© Copyright 2018 Xilinx
Adaptable.
Intelligent.