A New Era of Hardware Microservices in the Cloud · 2020-02-27 · In large scale production in...

A New Era of Hardware Microservices in the Cloud

Doug Burger

Distinguished Engineer, Microsoft

UW Cloud Workshop

March 31, 2017

Moore’s Law

3

• Dennard Scaling has

been dead for a decade

• Moore’s La is o er • Re e er, it’s a rate

• Silicon scaling continues • But lots of weirdnesses

Specialization (?)

Hardware Microservices

Cloud

(or client)

service

HuS A instance 1

HuS A instance 2

HuS B instance 1

HuS B instance 2

HuS C instance 1 IP + API @ us

Catapult V2 Architecture

7

CPU CPU FPGA

NIC

DRAM DRAM DRAM

WCS 2.0 Server Blade Catapult V2

Gen3 2x8

Gen3 x8

QPI Switch

QSFP

QSFP

QS

FP

40Gb/s

40Gb/s

WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA

Pikes Peak

WCS Tray

Backplane

Option Card

Mezzanine

Connectors

Catapult v2 Mezzanine card

• The architecture justifies the economics

1. Can act as a local compute accelerator

2. Can act as a network/storage accelerator

3. Can act as a remote distributed computing fabric

Configurable Cloud CPU compute layer

Reconfigurable

compute layer

Converged network

Slot DMA

Engine

Config

Flash

(RSU)

DDR3

Controller

JTAG

Temp.

Shell

I2C

4 256 Mb QSPI

Config Flash

4 GB DDR3

72

8

SEU

40G

MAC

Server NIC Top-of-Rack Switch

Host

CPU

8

Clock

128

4 4

256

40G

MAC

512

FIFO

40G Bypass Ctrl

FIFO

256 256 256 256

ASLs

User Logic

Role

x8 PCIe Core

(HIP 0)

x8 PCIe Core

(HIP 1)

HuS shell

Support

9

Network Switch

User Logic

Lightweight Transport Layer

Elastic Router

LTL Network reach and latencies

0

5

10

15

20

25

1 10 100 1000 10000 100000 1000000

Ro

un

d-T

rip

La

ten

cy (

us)

LTL L0 (same TOR)

LTL L1

Example L0 latency histogram

Example L1 latency histogram

Examples of L2 latency histograms for different pairs of FPGAs

Number of Reachable Hosts/FPGAs

LTL Average Latency

LTL 99.9th Percentile

LTL L2

10K 100K 250K

TOR TOR

L1

SQL

Deep neural

networks

Web search

ranking GFT Offload

Web search

ranking

L2

TOR

Scaling the HuS fabric

TOR

L1

TOR

Use case: shared service

Most accelerators have more throughput than a single host requires

Share excess capacity, use fewer instances

Frees up FPGAs for other use services

Sustains 3.6x clients / FPGA

73% remaining FPGAs available

-10

0

10

20

30

40

50

60

70

80

99th Bed-Wide Remote-Local 95th Bed-Wide Remote-Local

Hardware as a Service architecture

Scaling HuS FPGAs To Ultra-Large DNNs

• Thanks to Eric Chung and team

• Distribute NN models across as many FPGAs as needed (up to thousands)

• Recent Imagenet competition: 152-layer model

• Use HaaS and LTL to manage multi-FPGA execution

• Only vectors travel over network

• Low FPGA-FPGA latency at ~1.8us per L0 hop

Huge infrastructure: Scale is the enabler

Chicago

Cheyenne

Dublin

Amsterdam

Hong Kong

Singapore

Japan

San Antonio

Mi rosoft’s Cloud usiness growing in the triple digits annually

Boydton Shanghai

Quincy

Des Moines

Brazil

Australia

Finland

• FPGAs Deployed across 15 countries and 5 continents

• Catapult included in every new server for all major services

• In large scale production in both Bing and Azure (and others)

• Hardware Microservices (+ DNNs) in a subset and scaling

Looking forward …

Slot DMA

Engine

Config

Flash

(RSU)

DDR3

Controller

JTAG

Temp.

Shell

I2C

4 256 Mb QSPI

Config Flash

4 GB DDR3

72

8

SEU

40G

MAC

Server NIC Top-of-Rack Switch

Host

CPU

8

Clock

128

4 4

256

40G

MAC

512

FIFO

40G Bypass Ctrl

FIFO

256 256 256 256

ASLs

x8 PCIe Core

(HIP 0)

x8 PCIe Core

(HIP 1)

Gen2 Shell

Abstractions

17

Network Switch

User Logic


Elastic Router

Role

Network Switch

User Logic


Elastic Router

Role

SDN

Democratizing Hardware Microservices

Third-party

ASIC FPGA

Network

Host

FPGAs Deployed across 15 countries and 5 continents

Catapult included in every new server for all major services

In large scale production in both Bing and Azure (and others)

Hardware Microservices (+ DNNs) in a subset and scaling

Accelerated Networking

Generic Flow Table (GFT) rule based packet rewriting

10x latency reduction vs software, CPU load now <1 core

25Gb/s throughput at < 25µs latency – the fastest cloud network

On Haswell, AES GCM-128 costs 1.26 cycles/byte[1] (5+ 2.4Ghz cores to sustain 40Gb/s)

CBC and other algorithms are more expensive

AES CBC-128-SHA1 is 11µs in FPGA vs 4µs on CPU (1500B packet)

Higher latency, but significant CPU savings

GFT

Table FPGA 40G

Crypto Flow Action

Decap, DNAT, Rewrite, Meter1.2.3.1->1.3.4.1, 62362->80

GFT 40G

40G

NIC

VMs

[1] S. Gulley, et al. Has ell Cryptographi Perfor a e

Training

Inference

Client Cloud

Humans

ASICs

GPUs

FPGAs

A New Era of Hardware Microservices in the Cloud · 2020-02-27 · In large scale production in...

Documents

Transcript of A New Era of Hardware Microservices in the Cloud · 2020-02-27 · In large scale production in...