The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service...
Transcript of The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service...
Catapult: A Reconfigurable Fabric for Petaflop Computing in
the Cloud
Doug Burger
Director, Hardware, Devices, & Experiences
MSR NExT
November 15, 2015
The Cloud is a Growing Disruptor for HPC
Disruption
Homogeneity
Moore’s Law
Economics
A 2-3 Horse Race
Hyperscale Cloud Fabrics
ToR
ToR ToR
ToR
CS CS
ToR
CS
Accelerator Constraints of the Cloud
5
Efficiency(ASICS)
Homogeneity
Catapult Project History
• December 9, 2010 – initial meeting• Christmas break 2010: feasible to accelerate ranking?
• January 12, 2011 – Meeting with Bing leadership
• 2011 – v0: ported then Bing ranking stack, built BFB board
• 2012 – v1: developed distributed architecture
• 2013 – Took v1 to scale, Bing pilot
• 2014 – v2: developed new architecture, commenced work with Azure
• 2015 – Mainstreamed: production and expansion• Intel announced Altera acquisition, $16.7B
Microsoft Open Compute Server
Two 8-core Xeon 2.1 GHz CPUs64 GB DRAM4 HDDs, 2 SSDs10 Gb EthernetNo cable attachments to server
Microsoft Confidential 7
Catapult V1 Accelerator Card
Microsoft Confidential 8
• Altera Stratix V D5• 172.6K ALMs, 2014 M20Ks
• 457KLEs• 1 KLE == ~12K gates• M20K is a 2.5KB SRAM
• PCIe Gen 2 x8, 8GB DDR3• 20 Gb network among FPGAs
Stratix V
8GB DDR3
PCIe Gen3 x8
6x8 Torus in a 2x24 Server Layout
1,632 server pilot deployed in production BN datacenter
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
Target: Accelerate Ranking as a Service
SaaS 1
SaaS 2
SaaS48
SaaS 3
Ranking-as-a-Service (RaaS) - Compute relevance scores for each selected doc- Sort the scores and return the results
Selection-as-a-Service (SaaS)- Find all docs that contain query terms- Filter and select candidate documents for ranking
Selection as a Service (SaaS)
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
IFM 1
IFM 2
IFM 44
IFM 3
RaaS 1
RaaS 2
RaaS48
RaaS 3
Ranking as a Service (RaaS)
Query
SelectedDocuments
10 blue links
FPGA Accelerator for Bing Ranking
FFE: Free-Form Expressions
MLS: Machine Learning Scoring
FE: Feature Extraction
Document + Query
Score
Document features- Hand-coded Verilog
FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1)(2 * NumberOfTuples_0_1)
~4K features
~2K Synthetic featuresFE7
FFE3FFE2
FE9
≤ 𝑇1 > 𝑇1
≤ 𝑇2 > 𝑇2
score
≤ 𝑇3 > 𝑇3
scorescore
≤ 𝑇3 > 𝑇3
scorescore
Query Augmentation
Query Understanding
Document Selection
Document Ranking
Caption Generation
Page Assembly
FPGA 0
FPGA 1
FPGA 2
FPGA 3
FPGA 4
FPGA 5
FPGA 6
FPGA 7
12-Stage Pipeline
FPGA 8
FPGA 9
FPGA 10
FPGA 11
Demonstrated ~2x throughput gain and stability justifying production
Pilot Results (FPGA vs. Software)
0
500
1000
1500
2000
2500
3000
3500
4000
0 2 4 6 8 10
Thro
ugh
pu
t
Average Latency
Average Latency vs. Throughput
HW SW
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 5 10 15 20Th
rou
ghp
ut
Latency
95% Latency vs. Throughput
HW SW
Bing’s latencytarget at ~2X throughput
Catapult V1 Shell Architecture
PCIecore
Gen2 x8(Gen3 Capable)
PCIeDMA
Inter-FPGA router
Xcvrconfig
SLIIIcore
SLIIIcore
SLIIIcore
SLIIIcore
Local application
DDR3core
DDR3core
4GB SO-DIMM
RSU256 Mb NAND
120120
4444
4
Driver
Reconfig
Voltageregulator
4GB SO-DIMM1.5V
12V
0.85V
Status LEDs
JTAG
FPGA
…
… …
2 x
16
RA
Ms
32
B –
64
KB
/ s
lot
64
slo
ts
I O
Production issues at scale• Build system
• License servers, availability of source, build machines
• Scale-out qualification of IP
• Clean interfaces for high-productivity development environment
• Shell/driver/application versioning and deployment• Backwards compatibility
• Health monitoring and failure diagnostics• Continuous reporting of interfaces health, soft error rate, etc.
• Debugging (esp. on livesite)• Flight Data Recorder to replay bug-generating condition
• System integrity testing - many servers/vendors
• Scalability of verification
• In situ updates to drivers, golden image, shell
• Supply chain management
Azure SmartNIC
• Announced at ONS
• Use an FPGA for reconfigurable functions• FPGAs are already used in Bing (Catapult)• Roll out hardware as we do software
• Programmed using Generic Flow Tables (GFT)• Language for programming SDN to hardware• Uses connections and structured actions as
primitives
• SmartNIC can also do Crypto, QoS, storage acceleration, and more …• 40Gb bidirectional AES demo
Host
NIC ASIC
FPGA
CPU
ToR
FPGAs “versus” GPUs
Language C/C++ CUDA Verilog -> OpenCL (?)
Performance 400 Gflops 6 Tflops -> 10T 100G -> 1T -> 4T
Efficiency 5 Gflops/W -> 20 Gflops/W 40-50 G/W -> 80-100 G/W
Scale 2M+ and growing 1s -> 10s -> 100s 10Ks -> 100Ks -> 1M+
CPUs GPUs FPGAs
DRAM BW 85 GB/s 2x240 GB/s 10GB/s -> 20GB/s -> 200-500GB/s
Large-Scale Reconfigurable Computing for HPC
ToR
ToR ToR
ToR
CS CS
Deep Learning
Bing Ranking HW
HPC / MPI Offload
Deep Compression
Bing Ranking SW
Conclusions
• We are at the dawn of a new era
• Programmable logic playing a central role in systems at massive scale
• “A new kind of computer”
• Will enable new applications and services to be cost effective
• Will change system architecture, both in server and at cloud scale