Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories

21
Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith Underwood SNL/NM Craig Ulmer SNL/CA [email protected] SOS-8 Workshop April 14, 2004

description

Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories. Craig Ulmer SNL/CA [email protected]. Keith UnderwoodSNL/NM. SOS-8 Workshop April 14, 2004. Motivation: CPU Efficiency Trend. While CPU performance has been increasing.. - PowerPoint PPT Presentation

Transcript of Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories

Page 1: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing: FPGAs for Ultrascale Science

Sandia National Laboratories

Keith Underwood SNL/NM

Craig Ulmer SNL/[email protected]

SOS-8 WorkshopApril 14, 2004

Page 2: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Motivation: CPU Efficiency Trend

Efficiency: MFLOPS/MHz/Mtransistors

0

0.01

0.02

0.03

0.04

0.05

0.06

38616MHz

48666MHz

P1 75MHz

P1166MHz

P2450MHz

P3550MHz

P3800MHz

P31.0GHz

P42.8GHz

P43.2GHz

Efficiency

Processors

While CPU performance has been increasing....processing efficiency has been decreasing.

Page 3: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Looking Ahead

• For commodity clusters, should we be nervous?– Significant increases in technology effort– Diminishing returns– Should we depend on CPU manufacturers for HPC?

• Sandia has many HPC interests– Investigate computing alternatives and accelerators– FPGAs: Modern Reconfigurable Computing

Page 4: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Outline

Reconfigurable computingUse FPGAs to accelerate computations

Strategy and examplesApproaches to scientific computing

Challenges for ultrascale scienceDouble-precision floating-point performanceSystem integration and network aspects

Page 5: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing Background

“Soft Hardware”

Page 6: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Computing Spectrum

Executex / xor

Fetch

DecodeRegisters

+

Memory

Writeback

Software

General-PurposeCPU

•Easily reprogrammed•Low cost•Fundamental bottlenecks

+

z-1

xorx

+

x

A B D π

x

C

result

Hardware

Application-Specific Integrated Circuit (ASIC)

•Not modifiable•High cost•Extremely fast

Soft-Hardware

Field ProgrammableGate Arrays (FPGAs)

•Reconfigurable hardware•Medium cost•Speedup potential

Page 7: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Hardware Devices

• Tile architecture– Logic blocks (LBs)– Routing elements

• Field-Programmable Gate Arrays– Fine granularity– LBs are bit-level operators

• Commercial trend– Coarse granularity– LBs are ALUs, FPUs– QuickSilver, Pact XPP, ClearSpeed

LB

LB

LB

LB

LB LB

LB LB

LB

LB

LB LB LB

LB LB LB

Devices that can be programmed to emulate hardware circuitry

Page 8: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Common Acceleration Techniques

• Processing concurrency• Hardware pipelines • Custom memory interactions• Partial evaluation

SRAMSRAM

SRAM SRAM

InternalSRAM

Key: Designing in Hardware

A

B

(0-15)

B

Page 9: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing for Ultrascale Science:

HPC Strategy and Examples

Enhancing HPC Performance

Page 10: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

HPC Strategy at Sandia for RC

• RC resources work best as accelerators in HPC– Clusters are inexpensive & work well for many applications– Add RC devices to enhance performance

• Port key portions of algorithms to RC hardware– Focus on hotspots and inner loops– Move data to/from FPGAs in pipelined fashion

Page 11: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Scientific Computing Examples

• Pattern recognition– ATLAS project at CERN– Reduced 2500 CPUs to 120 nodes with FPGAs

• Visualization– Vizard II project at University of Tübingen– Direct volume rendering for 5123 datasets

• Molecular dynamics (MD)– Preliminary work at Los Alamos National Laboratory– 20 Cells in an FPGA yields 5.69 GFLOPS

• Computational fluid dynamics (CFD) analysis for jet engines– Smith and Schnore at GE Global Research

Inner Loop Function FLOPS P4 1.8GHz Host Multi-FPGA System

Euler 165 154 MFLOPS 10.2 GFLOPS

Viscous 619 77 MFLOPS 23.2 GFLOPSSmoothing 249 86 MFLOPS 7.0 GFLOPS

Page 12: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Craig UlmerSNL/CA

Keith UnderwoodSNL/NM

LANL,Academia

Industry

Challenges

• Hard to program– Hardware design– Must be significant parallelism

• Limited chip capacity

• Lack of HPC building blocks– Our users need DP-FP

• System integration– How do we add to our clusters?

Page 13: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing for Ultrascale Science:

Double-Precision Floating-Point Cores

Addressing the need for HPC building blocks

Page 14: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Double-Precision Floating-Point Cores

• Floating point has been historical weakness for FPGAs– FP cores consume significant amounts of hardware– Previous FPGAs lacked capacity

• Significant improvements in recent commercial FPGAs– Increased capacity, faster clocks, and better building blocks

• Keith Underwood at SNL/NM– Re-evaluating FP performance in FPGAs– Constructing high-speed DP-FP cores

Page 15: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Peak Performance Results

CoreSingle Precision Double Precision

Speed Cores per V2P100-6

Peak Performance Speed Cores per

V2P100-6Peak

Performance

Addition 195 MHz 89 17 GFLOPS 143 MHz 40 5.7 GFLOPS

Multiplication 176 MHz 74 13 GFLOPS 142 MHz 27 3.8 GFLOPS

Division 120 MHz 22 2.6 GFLOPS 98 MHz 6 0.58 GFLOPS

From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

Page 16: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Double-Precision Multiply Performance Trends

Page 17: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Reconfigurable Computing for Ultrascale Science:

Networking Aspects

Addressing capacity and system integration issues

Page 18: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Data Exchange:Multi-Gigabit Transceivers (MGTs)

• How do we rapidly move data into/out of FPGA?

• Xilinx Virtex-II/Pro FPGA has MGTs– Channel data rates: 3.125 Gbps– Up to 24 channels – V2/ProX: twenty 10Gbps channels

• Configured for different physical layers– InfiniBand, FC, GigE, 10GigE – S-ATA, PCI-Express, HT

FPGAFabric

Rocket I/O MGTPIN

PIN

Rocket I/O MGTPIN

PIN

Rocket I/O MGTPIN

PIN

Page 19: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Importance of MGTs

Increase Raw Capacity

• Connect FPGAs together– MGTs provide fat pipes– Cables, not PCB traces

System Integration

• Connect FPGA to SAN– Implement NI in FPGA– FPGA is global resource

FPGA

ComputationalCircuits

FPGA

ComputationalCircuits

FPGA

ComputationalCircuits

FPGA

ComputationalCircuits

Channel

Channel

Channel

Channel

Channel

Channel

FPGA

NI TxRx

NI TxRx

ComputationalCircuits

CPU

NIC

System Area NetworkCPU

NIC

CPU

NIC

Page 20: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Recent Sandia Work: SNL OpenTOE

• At Sandia we are interested in connecting FPGAs to SANs– Main target: InfiniBand– Must implement network protocols for reliable transfer

• Initial work: GigE and TCP– Implemented GigE core and basic TCP offload engine

NI GigE

IPCore

MGT

Tx

Rx

TCPCore

FPGA

ComputationalCircuits

SNL OpenTOE NI

Page 21: Reconfigurable Computing:  FPGAs for Ultrascale Science Sandia National Laboratories

Concluding Remarks

• Improvements in commercial FPGAs make RC attractive– FPGAs provide better sustained performance than CPUs– FPGA performance growing faster than Moore’s Law

• Near-term strategy: accelerator-based approach– Offload key operations into hardware

• Sandia National Labs investigating RC for HPC acceleration– Enabling scientific computing through fast DP FP cores– Addressing system integration/capacity issues via network