Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD...

48
Andrew Putnam – Azure Networking / Microsoft Research Kalin Ovtcharov – AI & Architectures September 11, 2019 Heterogeneous Computing @ Microsoft

Transcript of Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD...

Page 1: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Andrew Putnam – Azure Networking / Microsoft ResearchKalin Ovtcharov – AI & ArchitecturesSeptember 11, 2019

Heterogeneous Computing @ Microsoft

Page 2: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

• Interactive Cloud apps rely

on single threaded

performance

• CPUs aren’t getting much

faster. We just get a few

more

• 2x users requires ~2x the

number of servers

Technology Scaling

http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance/

~50%+ / year

~20% / year

~3%

Page 3: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

2013 2014 2015 2016

Datacenter Scaling

2017

Page 4: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

2016 2018

Page 5: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

2012 2018 Ratio

CPU Cores 16 36 2.25x

Storage 4 TB HDD 7 TB SDD (M.2)

(120TB HDD*)

1.75x

30x

Network 1Gb 50Gb 50x

Cloud Server Changes

Page 6: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

2012 2018 Ratio

CPU Cores 16 36 2.25x

Storage 4 TB HDD 7 TB SDD (M.2)

(120TB HDD*)

1.75x

30x

Network 1Gb 50Gb 50x

Cloud Server Changes

More Data, Coming in Faster… and CPUs aren’t keeping up

Increasing Server Lifetimes

Page 7: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Performance & Efficiency via Specialization

ASICsOther Accelerators (GPU, FPGA, NPU, DSP)

Source: Bob Broderson, Berkeley Wireless group

Page 8: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Microsoft’s Heterogeneous Cloud

EFFICIENCYFLEXIBILITY

FPGAs

Contro

l Unit

(CU)

Registers

Arithmeti

c Logic

Unit

(ALU)

CPUs GPUsASICsNPUs

8

Page 9: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Microsoft’s Heterogeneous Cloud

EFFICIENCYFLEXIBILITY

FPGAs

Contro

l Unit

(CU)

Registers

Arithmeti

c Logic

Unit

(ALU)

CPUs GPUsASICsNPUs

9

Page 10: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Microsoft’s Heterogeneous Cloud

EFFICIENCYFLEXIBILITY

FPGAs

Contro

l Unit

(CU)

Registers

Arithmeti

c Logic

Unit

(ALU)

CPUs GPUsASICsNPUs

10

Page 11: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Microsoft’s Heterogeneous Cloud

EFFICIENCYFLEXIBILITY

FPGAs

Contro

l Unit

(CU)

Registers

Arithmeti

c Logic

Unit

(ALU)

CPUs GPUsASICsNPUs

11

Page 12: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Microsoft’s Heterogeneous Cloud

EFFICIENCYFLEXIBILITY

FPGAs

Contro

l Unit

(CU)

Registers

Arithmeti

c Logic

Unit

(ALU)

CPUs GPUsASICsNPUs

12

Page 13: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

What are FPGAs?

Field Programmable Gate Array

FPGAs are a sea of generic logic and interconnect

“Silicon Legos” – build them into exactly the right circuit for each task

Special-purpose hardware (FPGAs) is faster and more efficient than general-purpose hardware (CPUs)

Low power compared to CPU/GPU

Change the hardware anytime!100 ms to 1 second reconfiguration time

Page 14: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

FPGA Physical Layout

DRAM

PCIe 0

PCIe 1

40G Network

Page 15: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Software FPGA ASIC

Gradual Migration to ASIC

Page 16: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

• Greater Performance and Efficiency than CPU, more general purpose than ASIC

• Many applications aren’t about throughput or double-precision floating point

• AI/ML, Bioinformatics, text processing, financial services…

• Exploits different forms of parallelism than other accelerators

Why is the FPGA a good choice as an accelerator?

Page 17: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

FE: Feature Extraction

Query: “FPGA Configuration”

NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1{Query, Document}

L2 Score

Document

Score

Page 18: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Feature Extraction Accelerator

Page 19: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

FPGAs in Physics Applications

Page 20: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

FPGAs in Cosmology

EOR Science can be done with a paperclip

and a supercomputer

-- Don C. Backer

Cosmologists often refer to their telescopes as

“software telescopes”

Page 21: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Processing Pipeline

Page 22: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Catapult: Long, Fruitful FPGA Investment

2011 2012 2013 2014 2015 2016 2017 2018

v0: Research POC

Built v0 board w/6 Xilinx FPGAs

30k lines of Bing code on FPGA

Catapult v1: Mt Granite

Distributed solution

Integrated with WCS (OCP) 1.0

v1: Scale Pilot

1632 servers deployed

Bing IndexServe accelerated

v2: Pikes Peak

Integrated Bing + Azure design

Bump-in-the-wire introduced

v2 Production and ramp

FPGAs reach production

Deployed in all new servers

Catapult v3: Longs Peak

DNN Platform for Bing

50Gb w/ integrated NIC

Azure AccelNet Unveiled

Azure production launch

AI Supercomputer demo

Project BrainWave / Storm Peak

Real-Time AI

First 3rd Party FPGA Service

Pre-History:

May 2009: Bing Launched

Feb 2010: Azure Launched

Dec 2010: Catapult concept

… …2019

Azure Databox Edge

On-Site Inference

Page 23: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Catapult v2 – Bump in the Wire

40 Gb NIC

Catapult v2 FPGA Card

CPU

PC

Ie G

en

3 x

8

PC

Ie G

en

3 x

8

PC

Ie G

en

3 x

8

ToR

40G

NIC and TOR

FPGA

4GB DDR

Page 24: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Catapult v3 – Converged Boards

Page 25: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Accelerator Integration

FPGAsCPUs FPGAs CPUs

FPGAsCPUsFPGAs CPUs

Page 26: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Basic Catapult Architecture

• Bump-in-the-wire architecture

• One FPGA in every server Microsoft has deployed since 2015

0.5m QSFP cable from NIC to FPGA

~3m QSFP cable from FPGA to TOR

FPGA

NIC

Page 27: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

New Generation Server Chassis

Page 28: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Bump-in-the-wire Architectue

CPU

NIC40G Ethernet

PCIe Gen 3 PCIe Gen 3

Compute AccelerationNetwork AccelerationHardware as a Service

AccelNet

BrainWaveML

Page 29: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Global-Scale FPGA

7 FPGAs Involved

2 FPGAs Involved

Page 30: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Configurable CloudCPU compute layer

Reconfigurable

compute layer

Converged network

Page 31: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

0

5

10

15

20

25

1 10 100 1000 10000 100000 1000000

Ro

un

d-T

rip

Late

ncy (

us)

LTL L0 (same TOR)

LTL L1

Example L0 latency histogram

Example L1 latency histogram

Examples of L2 latency histograms for different pairs of FPGAs

Number of Reachable Hosts/FPGAs

Catapult Gen1 Torus(can reach up to 48 FPGAs)

LTL Average Latency

LTL 99.9th Percentile

6x8 Torus Latency

LTL L2

10K 100K 250K

Network Latencies

Page 32: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

HPC with the Cloud?

• The idea sounds great

• Pay for compute only when you use it

• When it breaks, it’s someone

else’s problem

• No need to call the realtor /

utility company when you

want a bigger machine

• New hardware just shows up.

No retrofits needed.

Page 33: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Why hasn’t Supercomputing moved to the Cloud?

❑ CPUs look largely the same

❑ Networks are highly specialized, tuned for low-latency, high bandwidth

❑ Proved this works FPGA-to-FPGA, but what about software (CPU-to-CPU)?

❑ Supercomputers include specialized accelerators (especially GPUs)

❑ Won’t running virtual machines kill performance?

Page 34: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

VM / Container(10.1.1.1)

Network Acceleration – Azure Accelerated Networking

VM / Container(10.1.1.2)

Page 35: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

VM / Container(10.1.1.1)

Network Acceleration – Azure Accelerated Networking

VM / Container(10.1.1.2)

NIC

OS

NIC

OS

Page 36: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Hypervisor

NIC

SDN Stack

Guest OS

VM(10.1.1.1)

Virtualization Overhead – Standard Virtual MachinesSDN Stack FunctionsSLB: Software Load BalancersACL: Access Control ListsNAT: Network Address TranslationVNET: Virtual NetworksMetering

Page 37: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

SR-IOV NIC

Guest OS

VM

Guest OS

VM

Guest OS

VM

Guest OS

VM

Guest OS

VM

Hypervisor

SDN

Virtualization Overhead – SRIOV NICs

Page 38: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

SR-IOV NIC

Guest OS

VM

Guest OS

VM

Guest OS

VM

Guest OS

VM

Guest OS

VM

Hypervisor

SDN

SDN in FPGA

Virtualization Overhead – Azure SmartNIC w/ FPGAs

GFT

Page 39: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

SR-IOV NIC

VM VM VM VM VM

Hypervisor

SDN

SDN in FPGA

Virtualization Overhead – Azure SmartNIC w/ FPGAs & DPDK

GFT

DPDK DPDK DPDK DPDK DPDK

Page 40: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

AccelNet Performance

Page 41: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

• Use multiple FPGAs to perform a complex task

• Head node is the gateway

• Use as many nodes as necessary – calling function doesn’t need to know

Distributed Hardware-as-a-Service

F F F

L0

L1

F F F

L0

BrainWave Compiler & Mapper Scalable DNN Hardware

Microservice

Network switches

FPGAs

Pretrained DNN Model

(CNTK, TensorFlow, etc…)

Page 42: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Project BrainWave Concept

Accel.

CPU

Ethernet Switch

Traditional Approach Project Brainwave

FPGA

CPU

FPGA

CPU

FPGA

CPU

Page 43: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Brainwave System at Cloud-Scale

Page 44: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

• 8 billion operations per image

• 1.5 ms on Resnet-50 with batch size of 1 (Realtime AI)

• Available for anyone to upload their models and run

Project Brainwave – Fastest Image Classification

https://github.com/Azure/aml-real-time-ai

Page 45: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

Bing Intelligent Search Backed By Project BrainwaveStratix V RNN-optimized NPUs

in Scale Production

45

Page 46: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

Attend the Turotial tomorrow for more detail on BrainWave

Page 47: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Microsoft Corporation

• Performance? Low Power? Ubiquity? Killer features? AI/ML?

Why will FPGAs still be here tomorrow?

Azure (Nov 2018) Amazon AWS (Nov 2018)

• Every time we change hardware, we need to make a new VM type

• FPGAs allow us to back new features into existing VM types

B

Dv3

Dsv3 Dv2Av2

DC

M

Fsv2 FGs

Ev3

GLs NCv2

NDBV

NVv2

HT2

M5

C5 R4

X1eP3

H1

I3

D2A1

F1

Page 48: Heterogeneous Computing @ Microsoft€¦ · 2012 2018 Ratio CPU Cores 16 36 2.25x Storage 4 TB HDD 7 TB SDD (M.2) (120TB HDD*) 1.75x 30x Network 1Gb 50Gb 50x Cloud Server Changes

© Copyright Microsoft Corporation. All rights reserved.