A New Era of Hardware Microservices in the Cloud · 2020-02-27 · In large scale production in...
Transcript of A New Era of Hardware Microservices in the Cloud · 2020-02-27 · In large scale production in...
A New Era of Hardware Microservices in the Cloud
Doug Burger
Distinguished Engineer, Microsoft
UW Cloud Workshop
March 31, 2017
Moore’s Law
3
• Dennard Scaling has
been dead for a decade
• Moore’s La is o er • Re e er, it’s a rate
• Silicon scaling continues • But lots of weirdnesses
Specialization (?)
Hardware Microservices
Cloud
(or client)
service
HuS A instance 1
HuS A instance 2
HuS B instance 1
HuS B instance 2
HuS C instance 1 IP + API @ us
Catapult V2 Architecture
7
CPU CPU FPGA
NIC
DRAM DRAM DRAM
WCS 2.0 Server Blade Catapult V2
Gen3 2x8
Gen3 x8
QPI Switch
QSFP
QSFP
QS
FP
40Gb/s
40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak
WCS Tray
Backplane
Option Card
Mezzanine
Connectors
Catapult v2 Mezzanine card
• The architecture justifies the economics
1. Can act as a local compute accelerator
2. Can act as a network/storage accelerator
3. Can act as a remote distributed computing fabric
Configurable Cloud CPU compute layer
Reconfigurable
compute layer
Converged network
Slot DMA
Engine
Config
Flash
(RSU)
DDR3
Controller
JTAG
Temp.
Shell
I2C
4 256 Mb QSPI
Config Flash
4 GB DDR3
72
8
SEU
40G
MAC
Server NIC Top-of-Rack Switch
Host
CPU
8
Clock
128
4 4
256
40G
MAC
512
FIFO
40G Bypass Ctrl
FIFO
256 256 256 256
ASLs
User Logic
Role
x8 PCIe Core
(HIP 0)
x8 PCIe Core
(HIP 1)
HuS shell
Support
9
Network Switch
User Logic
Lightweight Transport Layer
Elastic Router
LTL Network reach and latencies
0
5
10
15
20
25
1 10 100 1000 10000 100000 1000000
Ro
un
d-T
rip
La
ten
cy (
us)
LTL L0 (same TOR)
LTL L1
Example L0 latency histogram
Example L1 latency histogram
Examples of L2 latency histograms for different pairs of FPGAs
Number of Reachable Hosts/FPGAs
LTL Average Latency
LTL 99.9th Percentile
LTL L2
10K 100K 250K
TOR TOR
L1
SQL
Deep neural
networks
Web search
ranking GFT Offload
Web search
ranking
L2
TOR
Scaling the HuS fabric
TOR
L1
TOR
Use case: shared service
Most accelerators have more throughput than a single host requires
Share excess capacity, use fewer instances
Frees up FPGAs for other use services
Sustains 3.6x clients / FPGA
73% remaining FPGAs available
-10
0
10
20
30
40
50
60
70
80
99th Bed-Wide Remote-Local 95th Bed-Wide Remote-Local
Hardware as a Service architecture
Scaling HuS FPGAs To Ultra-Large DNNs
• Thanks to Eric Chung and team
• Distribute NN models across as many FPGAs as needed (up to thousands)
• Recent Imagenet competition: 152-layer model
• Use HaaS and LTL to manage multi-FPGA execution
• Only vectors travel over network
• Low FPGA-FPGA latency at ~1.8us per L0 hop
Huge infrastructure: Scale is the enabler
Chicago
Cheyenne
Dublin
Amsterdam
Hong Kong
Singapore
Japan
San Antonio
Mi rosoft’s Cloud usiness growing in the triple digits annually
Boydton Shanghai
Quincy
Des Moines
Brazil
Australia
Finland
• FPGAs Deployed across 15 countries and 5 continents
• Catapult included in every new server for all major services
• In large scale production in both Bing and Azure (and others)
• Hardware Microservices (+ DNNs) in a subset and scaling
Looking forward …
Slot DMA
Engine
Config
Flash
(RSU)
DDR3
Controller
JTAG
Temp.
Shell
I2C
4 256 Mb QSPI
Config Flash
4 GB DDR3
72
8
SEU
40G
MAC
Server NIC Top-of-Rack Switch
Host
CPU
8
Clock
128
4 4
256
40G
MAC
512
FIFO
40G Bypass Ctrl
FIFO
256 256 256 256
ASLs
x8 PCIe Core
(HIP 0)
x8 PCIe Core
(HIP 1)
Gen2 Shell
Abstractions
17
Network Switch
User Logic
Lightweight Transport Layer
Elastic Router
Role
Network Switch
User Logic
Lightweight Transport Layer
Elastic Router
Role
SDN
Democratizing Hardware Microservices
Third-party
ASIC FPGA
Network
Host
FPGAs Deployed across 15 countries and 5 continents
Catapult included in every new server for all major services
In large scale production in both Bing and Azure (and others)
Hardware Microservices (+ DNNs) in a subset and scaling
Accelerated Networking
Generic Flow Table (GFT) rule based packet rewriting
10x latency reduction vs software, CPU load now <1 core
25Gb/s throughput at < 25µs latency – the fastest cloud network
On Haswell, AES GCM-128 costs 1.26 cycles/byte[1] (5+ 2.4Ghz cores to sustain 40Gb/s)
CBC and other algorithms are more expensive
AES CBC-128-SHA1 is 11µs in FPGA vs 4µs on CPU (1500B packet)
Higher latency, but significant CPU savings
GFT
Table FPGA 40G
Crypto Flow Action
Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80
GFT 40G
40G
NIC
VMs
[1] S. Gulley, et al. Has ell Cryptographi Perfor a e
Training
Inference
Client Cloud
Humans
ASICs
GPUs
FPGAs