Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith...
-
Upload
damian-burns -
Category
Documents
-
view
217 -
download
2
Transcript of Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith...
![Page 1: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/1.jpg)
Reconfigurable Computing: FPGAs for Ultrascale Science
Sandia National Laboratories
Keith Underwood SNL/NM
Craig Ulmer SNL/[email protected]
SOS-8 WorkshopApril 14, 2004
![Page 2: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/2.jpg)
Motivation: CPU Efficiency Trend
Efficiency: MFLOPS/MHz/Mtransistors
0
0.01
0.02
0.03
0.04
0.05
0.06
38616MHz
48666MHz
P1 75MHz
P1166MHz
P2450MHz
P3550MHz
P3800MHz
P31.0GHz
P42.8GHz
P43.2GHz
Efficiency
Processors
While CPU performance has been increasing....processing efficiency has been decreasing.
![Page 3: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/3.jpg)
Looking Ahead
• For commodity clusters, should we be nervous?– Significant increases in technology effort
– Diminishing returns
– Should we depend on CPU manufacturers for HPC?
• Sandia has many HPC interests– Investigate computing alternatives and accelerators
– FPGAs: Modern Reconfigurable Computing
![Page 4: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/4.jpg)
Outline
Reconfigurable computingUse FPGAs to accelerate computations
Strategy and examplesApproaches to scientific computing
Challenges for ultrascale scienceDouble-precision floating-point performanceSystem integration and network aspects
![Page 5: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/5.jpg)
Reconfigurable Computing Background
“Soft Hardware”
![Page 6: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/6.jpg)
Computing Spectrum
Executex / xor
Fetch
Decode
Registers
+
Memory
Writeback
Software
General-PurposeCPU
•Easily reprogrammed•Low cost•Fundamental bottlenecks
+
z-1
xorx
+
x
A B D π
x
C
result
Hardware
Application-Specific Integrated Circuit (ASIC)
•Not modifiable•High cost•Extremely fast
Soft-Hardware
Field ProgrammableGate Arrays (FPGAs)
•Reconfigurable hardware•Medium cost•Speedup potential
![Page 7: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/7.jpg)
Reconfigurable Hardware Devices
• Tile architecture– Logic blocks (LBs)– Routing elements
• Field-Programmable Gate Arrays– Fine granularity– LBs are bit-level operators
• Commercial trend– Coarse granularity– LBs are ALUs, FPUs– QuickSilver, Pact XPP, ClearSpeed
LB
LB
LB
LB
LB LB
LB LB
LB
LB
LB LB LB
LB LB LB
Devices that can be programmed to emulate hardware circuitry
![Page 8: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/8.jpg)
Common Acceleration Techniques
• Processing concurrency• Hardware pipelines • Custom memory interactions• Partial evaluation
SRAMSRAM
SRAM SRAM
InternalSRAM
Key: Designing in Hardware
A
B
(0-15)
B
![Page 9: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/9.jpg)
Reconfigurable Computing for Ultrascale Science:
HPC Strategy and Examples
Enhancing HPC Performance
![Page 10: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/10.jpg)
HPC Strategy at Sandia for RC
• RC resources work best as accelerators in HPC– Clusters are inexpensive & work well for many applications
– Add RC devices to enhance performance
• Port key portions of algorithms to RC hardware– Focus on hotspots and inner loops
– Move data to/from FPGAs in pipelined fashion
![Page 11: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/11.jpg)
Scientific Computing Examples
• Pattern recognition– ATLAS project at CERN– Reduced 2500 CPUs to 120 nodes with FPGAs
• Visualization– Vizard II project at University of Tübingen– Direct volume rendering for 5123 datasets
• Molecular dynamics (MD)– Preliminary work at Los Alamos National Laboratory– 20 Cells in an FPGA yields 5.69 GFLOPS
• Computational fluid dynamics (CFD) analysis for jet engines– Smith and Schnore at GE Global Research
Inner Loop Function FLOPS P4 1.8GHz Host Multi-FPGA System
Euler 165 154 MFLOPS 10.2 GFLOPS
Viscous 619 77 MFLOPS 23.2 GFLOPS
Smoothing 249 86 MFLOPS 7.0 GFLOPS
![Page 12: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/12.jpg)
Craig UlmerSNL/CA
Keith UnderwoodSNL/NM
LANL,Academia
Industry
Challenges
• Hard to program– Hardware design– Must be significant parallelism
• Limited chip capacity
• Lack of HPC building blocks– Our users need DP-FP
• System integration– How do we add to our clusters?
![Page 13: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/13.jpg)
Reconfigurable Computing for Ultrascale Science:
Double-Precision Floating-Point Cores
Addressing the need for HPC building blocks
![Page 14: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/14.jpg)
Double-Precision Floating-Point Cores
• Floating point has been historical weakness for FPGAs– FP cores consume significant amounts of hardware– Previous FPGAs lacked capacity
• Significant improvements in recent commercial FPGAs– Increased capacity, faster clocks, and better building blocks
• Keith Underwood at SNL/NM– Re-evaluating FP performance in FPGAs– Constructing high-speed DP-FP cores
![Page 15: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/15.jpg)
Peak Performance Results
Core
Single Precision Double Precision
SpeedCores per V2P100-6
Peak Performance
SpeedCores per V2P100-6
Peak Performance
Addition 195 MHz 89 17 GFLOPS 143 MHz 40 5.7 GFLOPS
Multiplication 176 MHz 74 13 GFLOPS 142 MHz 27 3.8 GFLOPS
Division 120 MHz 22 2.6 GFLOPS 98 MHz 6 0.58 GFLOPS
From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04
![Page 16: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/16.jpg)
Double-Precision Multiply Performance Trends
![Page 17: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/17.jpg)
Reconfigurable Computing for Ultrascale Science:
Networking Aspects
Addressing capacity and system integration issues
![Page 18: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/18.jpg)
Data Exchange:Multi-Gigabit Transceivers (MGTs)
• How do we rapidly move data into/out of FPGA?
• Xilinx Virtex-II/Pro FPGA has MGTs– Channel data rates: 3.125 Gbps– Up to 24 channels – V2/ProX: twenty 10Gbps channels
• Configured for different physical layers– InfiniBand, FC, GigE, 10GigE – S-ATA, PCI-Express, HT
FPGAFabric
Rocket I/O MGTPIN
PIN
Rocket I/O MGTPIN
PIN
Rocket I/O MGTPIN
PIN
![Page 19: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/19.jpg)
Importance of MGTs
Increase Raw Capacity
• Connect FPGAs together– MGTs provide fat pipes– Cables, not PCB traces
System Integration
• Connect FPGA to SAN– Implement NI in FPGA– FPGA is global resource
FPGA
ComputationalCircuits
FPGA
ComputationalCircuits
FPGA
ComputationalCircuits
FPGA
ComputationalCircuits
Channel
Channel
Channel
Channel
Channel
Channel
FPGA
NI Tx
Rx
NI Tx
Rx
ComputationalCircuits
CPU
NIC
System Area Network
CPU
NIC
CPU
NIC
![Page 20: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/20.jpg)
Recent Sandia Work: SNL OpenTOE
• At Sandia we are interested in connecting FPGAs to SANs– Main target: InfiniBand– Must implement network protocols for reliable transfer
• Initial work: GigE and TCP– Implemented GigE core and basic TCP offload engine
NI
GigEIP
CoreMGT
Tx
Rx
TCPCore
FPGA
ComputationalCircuits
SNL OpenTOE NI
![Page 21: Reconfigurable Computing: FPGAs for Ultrascale Science Sandia National Laboratories Keith UnderwoodSNL/NM Craig Ulmer SNL/CA cdulmer@sandia.gov SOS-8 Workshop.](https://reader035.fdocuments.us/reader035/viewer/2022062517/56649f005503460f94c1693e/html5/thumbnails/21.jpg)
Concluding Remarks
• Improvements in commercial FPGAs make RC attractive– FPGAs provide better sustained performance than CPUs– FPGA performance growing faster than Moore’s Law
• Near-term strategy: accelerator-based approach– Offload key operations into hardware
• Sandia National Labs investigating RC for HPC acceleration– Enabling scientific computing through fast DP FP cores– Addressing system integration/capacity issues via network