On the Road to Exascale: Lessons from Contemporary ... · ... Full PCIe X16 Datacenter GPFS and TG...
Transcript of On the Road to Exascale: Lessons from Contemporary ... · ... Full PCIe X16 Datacenter GPFS and TG...
On the Road to Exascale: Lessons from Contemporary
Scalable GPU Systems
Jeffrey Vetter
http://ft.ornl.gov [email protected]
Presented to ATIP/A*STAR/NSF Workshop on Accelerator Technologies
Singapore 8 May 2012
Highlights
Today’s HPC landscape is growing more diverse
Projections for Exascale architectures are extreme – Scalability to 1B threads – Move to diverse, heterogeneous architectures – Limited memory capacity and bandwidth
Emerging technologies – GPUs – NVRAM technologies may offer a solution
Lessons from today’s emerging technologies for Exascale
– Software infrastructure plays a pivotal role
Contemporary Architectures Date System Location Comp Comm Peak
(PF)
Power
(MW)
2009 Jaguar; Cray XT5 ORNL AMD 6c Seastar2 2.3 7.0
2010 Tianhe-1A NSC Tianjin Intel + NVIDIA Proprietary 4.7 4.0
2010 Nebulae NSCS Shenzhen Intel + NVIDIA IB 2.9 2.6
2010 Tsubame 2 TiTech Intel + NVIDIA IB 2.4 1.4
2011 K Computer RIKEN/Kobe SPARC64 VIIIfx Tofu 10.5 12.7
2011 Sunway BlueLight NSC Jinan Shenwei SW-1600 IB 1 1
2012 Titan; Cray XK6 ORNL AMD + NVIDIA Gemini 10-20 7?
2012 Mira; BlueGeneQ ANL SoC Proprietary 10 4-5?
2012 Sequoia; BlueGeneQ LLNL SoC Proprietary 20 8-10?
2012 Blue Waters; Cray NCSA/UIUC AMD + (partial)
NVIDIA
Gemini 10? ?
2013 Stampede TACC Intel + MIC IB 10? 10?
K: #1 in November 2011
10.5 PF (93% of peak) – 12.7 MW
705,024 cores 1.4 PB memory
Source: Riken, Fujitsu
Notional Exascale Architecture Targets (From Exascale Arch Report 2009)
System attributes 2001 2010 “2015” “2018”
System peak 10 Tera 2 Peta 200 Petaflop/sec 1 Exaflop/sec
Power ~0.8 MW 6 MW 15 MW 20 MW
System memory 0.006 PB 0.3 PB 5 PB 32-64 PB
Node performance 0.024 TF 0.125 TF 0.5 TF 7 TF 1 TF 10 TF
Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec
Node concurrency 16 12 O(100) O(1,000) O(1,000) O(10,000)
System size (nodes)
416 18,700 50,000 5,000 1,000,000 100,000
Total Node Interconnect BW
1.5 GB/s 150 GB/sec 1 TB/sec 250 GB/sec 2 TB/sec
MTTI day O(1 day) O(1 day)
http://science.energy.gov/ascr/news-and-resources/workshops-and-conferences/grand-challenges/
NVIDIA Echelon System Sketch
DARPA Echelon team: NVIDIA, ORNL, Micron, Cray, Georgia Tech, Stanford, UC-Berkeley, U Penn, Utah, Tennessee, Lockheed Martin
Keeneland Initial Delivery system procured and installed in Oct 2010
201 TFLOPS in 7 racks (90 sq ft incl service area)
902 MFLOPS per watt on HPL (#12 on Green500)
Upgraded April 2012 to 255 TFLOPS
Final delivery system over 500 TFLOPS coming this summer Keeneland System
(7 Racks)
ProLiant SL390s G7 (2CPUs, 3GPUs)
S6500 Chassis (4 Nodes)
Rack (6 Chassis)
M2070
Xeon 5660
12000-Series Director Switch
Integrated with NICS Datacenter GPFS and TG Full PCIe X16
bandwidth to all GPUs
67 GFLOPS
515 GFLOPS
1679 GFLOPS 24/18 GB
6718 GFLOPS
40306 GFLOPS
201528 GFLOPS
http://keeneland.gatech.edu
J.S. Vetter, R. Glassbrook et al., “Keeneland: Bringing heterogeneous GPU computing to the computational science community,” IEEE Computing in Science and Engineering, 13(5):90-5, 2011, http://dx.doi.org/10.1109/MCSE.2011.83.
Keeneland Full-Scale System
500+ TF Full scale system being built now – Installation scheduled for June
Configuration – 700+ NVIDIA M2090 GPUs – HP ProLiant S250 G8 Nodes
• 2 Sandy Bridge processors
– Mellanox IB FDR
Petabyte+ Lustre fileystem NSF XSEDE Production resource as of Oct 1
Today’s architectures force applications to use very complex programming models
MPI
Low overhead
Resource contention
Locality
OpenMP, Pthreads
SIMD
NUMA
OpenACC, CUDA, OpenCL Memory use,
coalescing Data orchestration
Fine grained parallelism
Hardware features
E.M. Aldrich, J. Fernández-Villaverde et al., “Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors,” Journal of Economic Dynamics and Control, 35(3):386-93, 2011,
Observations on GPU HPC Challenges
Lack of hardware features – double precision arithmetic, ECC, etc
Scalability continues unabated Programming complexity
– Reasonable abstractions to program architectures
(Performance) Portability – Functional portability is mandatory
• MPI and OpenMP are functional specifications
– Performance portability is becoming critically important • Though, well-funded experts can always optimize their
applications
– Even GPU systems are usually different • Node architecture, host processors, interconnects, mem
Crossing the Chasm, Geoffrey A. Moore
Rel
ativ
e %
of
Cu
sto
mer
s
How to make technology more accessible?
Broadening Application Base
Application Drivers for Exascale
Town Hall Meetings April-June 2007 Scientific Grand Challenges Workshops
Nov, 2008 – Oct, 2009 – Climate Science (11/08),
– High Energy Physics (12/08),
– Nuclear Physics (1/09),
– Fusion Energy (3/09),
– Nuclear Energy (5/09),
– Biology (8/09),
– Material Science and Chemistry (8/09),
– National Security (10/09)
– Cross-cutting technologies (2/10)
Exascale Steering Committee – Extreme Architecture and Technology
Workshop 12/2009
International Exascale Software Project
MISSION IMPERATIVES
FUNDAMENTAL SCIENCE
http://science.energy.gov/ascr/news-and-resources/workshops-and-conferences/grand-challenges/ http://www.exascale.org/iesp/Main_Page
Codesign View of HPC
Applications
• Materials
• Climate
• Fusion
• National Security
• Combustion
• Nuclear Energy
• Cybersecurity
• Biology
• High Energy Physics
• Energy Storage
• Photovoltaics
• National Competitiveness
• Usage Scenarios
• Ensembles
• UQ
• Visualization
• Analytics
Programming Environment
• Domain specific
• Libraries
• Frameworks
• Templates
• Domain specific languages
• Patterns
• Autotuners
• Platform specific
• Languages
• Compilers
• Interpreters/Scripting
• Performance and Correctness Tools
• Source code control
System Software
• Resource Allocation
• Scheduling
• Security
• Communication
• Synchronization
• Filesystems
• Instrumentation
• Virtualization
Architectures
• Processors
• Multicore
• Graphics Processors
• Vector processors
• FPGA
• DSP
• Memory and Storage
• Shared (cc, scratchpad)
• Distributed
• RAM
• Storage Class Memory
• Disk
• Archival
• Interconnects
• Infiniband
• IBM Torrent
• Cray Gemini, Aires
• BGL/P/Q
• 1/10/100 GigE
Performance, Resilience, Power, Programmability
Impact and Champions Milestones/Dates/Status
Novel Ideas
Principal Investigator: Robert Rosner, ANL March 1, 2012
Scheduled Actual
Kernels, initial codes in repository 1/12 12/11
Formulation of 1st-year calculation 1/12 1/12
NEK data structures in MOAB 1/12 1/12
Initial performance model for NEK 7/12 -
Initial performance analysis for UNIC 7/12 -
Initial uncertainty quant. runs 7/12 -
Complete pin bundle calculations 10/12 -
Custom viz design for NEK/UNIC output 12/12 -
• Develop innovative, scalable algorithms for
neutronics and thermo-hydraulics
computations suitable for exascale computers
• Couple high-fidelity thermo-hydraulics and
neutronics codes for challenging multi-scale,
multi-physics computations
• Drive design decisions for next-generation
programming models and computer
architectures at the exascale
IMD CESAR – Center for Exascale Simulation of Advanced Reactors
Simulating a complete nuclear power system in fine detail will fundamentally change the paradigm of how advanced nuclear reactors are designed, built, tested and operated. • Every step of the nuclear regulatory timeline can be
compressed by guiding expensive experiment efforts. • New designs can be rapidly prototyped, accident
scenarios can be studied in detail, material properties can be discovered, and design margins can be dramatically improved.
• Scientists can analyze problems for a wide range of novel reactor systems.
Impact and Champions Milestones/Dates/Status
Novel Ideas
30 Mar 2012
Scheduled Actual
Kickoff Workshop AUG 2011 AUG 2011
Initial molecular dynamics DEC 2011 DEC 2011
(MD) SPMD proxy app(s)
Initial scale-bridging MPMD MAY 2012
proxy app(s)
Prototype MD DSL SEP 2012
Assessment of data/resource 2013
sharing requirements, both for
scale-bridging and in situ
visualization/analysis
Demonstrate scale-bridging 2015
on 10+ PF-class platform
IMPACT. Our goal is to establish the
interrelationship between hardware, middleware
(software stack), programming models, and
algorithms required to enable a productive
exascale environment for multiphysics simulations
of materials in extreme mechanical and radiation
environments.
The design and development of extreme
environment tolerant advanced materials by
manipulating microstructure and interfaces, at the
grain scale, depends on such predictive
capabilities.
Embedded Scale-Bridging Materials Science
Adaptive physics refinement
Asynchronous task-based approach
Agile Development of Proxy Application Suite
Single-scale apps target node-level issues
Scale-bridging apps target system-level issues
Co-optimization for P3R: Price, Performance, Power,
and Resiliency
ASPEN, SST models & simulators
GREMLIN emulator for stress-testing
IMD Exascale Co-Design Center for Materials in Extreme Environments
Director: Tim Germann (LANL)
Deputy Director: Jim Belak (LLNL)
The Scalable Heterogeneous Computing Benchmark Suite (SHOC)
Fast moving hardware and software architectures
Focus on scientific computing workloads: – FFT, MD, QTC, S3D, Stencil, BFS
Parallelized with MPI, with support for multi-GPU and cluster scale comparisons
Implemented in both CUDA and OpenCL for a 1:1 comparison
Include system, stability tests SHOC Results Data Maintained on Google Public Data Explorer
A. Danalis, G. Marin, C. McCurdy, J. Meredith, P.C. Roth, K. Spafford, V. Tipparaju, and J.S. Vetter, “The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite,” in Third Workshop on General-Purpose Computation on Graphics Processors (GPGPU 2010)`. Pittsburgh, 2010
SHOC available at http://bit.ly/shocmarx
Example: Molecular Dynamics
Motivation – Classic nbody pairwise
computation, important to all MD codes such as LAMMPS, AMBER, NAMD, Gromacs, Charmm
Basic design – Computation of the
Lennard Jones potential
– 3D domain, random distribution
– Neighbor list algorithm
Example: S3D Motivation
– Measure performance of important DoE application
– S3D solves Navier-Stokes equations for a regular 3D domain, used to simulate combustion
– Benchmark is the ethylene-air getrates kernel
Basic design – Assign each grid point to a device
thread
– Highly parallel, as grid points are independent
OpenCL/CUDA observations – CUDA performance dominates OpenCL
• Big factor: native transcendentals (sin, cos, tan, etc.)
3D Regular Domain Decomposition – Each thread handles a grid point, blocks handle regions
AMD’s Llano: A-Series APU
Combines – 4 x86 cores
– Array of Radeon cores
– Multimedia accelerators
– Dual channel DDR3
32nm
Up to 29 GB/s memory bandwidth
Up to 500 Gflops SP
45W TDP
Source: AMD
Tight integrations: Evaluation of Llano v. Discrete GPUs
To appear: K. Spafford, J.S. Meredith, S. Lee, D. Li, P.C. Roth, and J.S. Vetter, “The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures,” in ACM Computing Frontiers (CF). Cagliari, Italy: ACM, 2012. Note: Both SB and Llano are consumer parts, not server parts.
CUDA and OpenCL Today’s performance ?
This chart shows the speedup of CUDA over OpenCL on a single Tesla M2070 on KIDS (CUDA 4.0)
39
1.00 1.00 0.99 1.01 1.00 0.94 1.00 1.00 0.99
5.21 6.15
1.49 1.77
1.18 1.67
1.01 1.00 1.01 1.08 1.06 1.00
2.17 1.08 1.01
2.41 1.53
0.93 0.96 0.94 1.11
maxspflops
gmem_readbw
lmem_readbw
tex_readbw
bspeed_readback
fft_dp
dgemm_n
md_dp_bw
reduction_pcie
scan
spmv_csr_scalar_sp
spmv_ellpackr_sp
spmv_csr_vector_dp
stencil
s3d
A second look 40
1.00
1.00
0.99
1.01
1.00
0.94
1.00
1.00
0.99
5.21
6.15
1.49
1.77
1.18
1.67
1.01
1.00
1.01
1.08
1.06
1.00
2.17
1.08
1.01
2.41
1.53
0.93
0.96
0.94
1.11
maxspflops
maxdpflops
gmem_readbw
gmem_writebw
lmem_readbw
lmem_writebw
tex_readbw
bspeed_download
bspeed_readback
fft_sp
fft_dp
sgemm_n
dgemm_n
md_sp_bw
md_dp_bw
reduction
reduction_pcie
reduction_dp
scan
scan_dp
spmv_csr_scalar_sp
spmv_csr_vector_sp
spmv_ellpackr_sp
spmv_csr_scalar_dp
spmv_csr_vector_dp
spmv_ellpackr_dp
stencil
stencil_dp
s3d
s3d_dp
Reliant on texture memory
Dependent on transcendentals
27.4
205.2
300.3
54.8 10.4
165.9
292.9
373.1
72.1
16.4
0
100
200
300
400
FFT MD SGEMM Reduction Scan
OpenCL
CUDA
But this wasn’t always the case…
v3.0 (Feb 2010)
v2.3 (Sept 2009)
Tesla C1060 GPU
Who is using SHOC?
Thousands of hits per month, thousands of downloads
Procurements and acceptance tests – Procurement and acceptance on NSF Keeneland System – Various government agencies
AMD, ARM, IBM, NVIDIA, Intel, etc – Nvcc compiler regression suite – OpenCL compiler regression suite
Scientists are using SHOC for research (55 citations to date) Try SHOC at http://bit.ly/shocmarx
Directive-based Compilation and Tuning
PGI Accelerate, HMPP Caps Enterprise, Cray, R-stream
OpenACC (http://www.openacc-standard.org/)
OpenMPC (OpenMP extended for CUDA) – OpenMPC = OpenMP + a new set of
directives and environment variables for CUDA
– OpenMPC provides • A high level abstraction of the CUDA
programming model (Programmability) • An easy tuning environment to generate
CUDA programs in many optimization variants (Tunability)
Seyong Lee et al., OpenMPC: Extended OpenMP Programming and Tuning for GPUs, SC10: Proceedings of the 2010 ACM/IEEE conference on Supercomputing (Best Student Paper Award), November 2010.
Existing GPU Directive Landscape
Features OpenMPC hiCUDA PGI HMPP R-Stream OpenACC
Code regions to be offloaded General
structured blocks
General structured
block Loops Loops Loops Loops
Loop mapping Parallel Parallel Parallel/Vector Parallel Parallel Parallel/Vector
Data management
GPU memory allocation and free
Explicit/Implicit Explicit Explicit/Implicit Explicit/Implicit Implicit Explicit/Implicit
Data movement between CPU and GPU
Explicit/Implicit Explicit Explicit/Implicit
Explicit/Implicit
Implicit
Explicit/Implicit
Compiler optimizations
Loop transformations Explicit Implicit Explicit Implicit Imp-dep
Data management optimizations
Explicit/Implicit Implicit Explicit/Implicit Explicit/Implicit Implicit Imp-dep
GPU-specific features
Thread batching Explicit/Implicit Explicit Indirect/Implici
t Explicit/Implicit Explicit/Implicit
Indirect/Implicit
Utilization of special memories
Explicit/Implicit Explicit Indirect/Implici
t Explicit Implicit
Indirect/ Imp-dep
46
Performance of High-Level GPU Programming Models (Anonymized)
•Speedups are over serial on the CPU compiled with GCC v4.1.2 using option -O3, when the largest available input data were used.
•Experimental Platform: CPU: Intel Xeon E5520 at 2.27 GHz
GPU: NVIDIA Tesla C2050 with 448 CUDA cores at 1.15GHz
0.1
1
10
100
1000
Benchmark
S p e e
d u p
161
397 588 532 1
2 3
Hand-Written CUDA Performance Variation By Tuning
Stencil2DParallelProfile/Trace
Keeneland Tutorial, April 14-15, 2011 43
Toos: Ocelot, MAGMA, HyVM*, TAU 47
Courtesy Allen Malony, U. Oregon, DOE Vancouver Project: https://ft.ornl.gov/trac/vancouver
Integrated GPU and CPU
Performance Counters
Other software activities
Maestro autotuning for OpenCL
GA-GPU – Global Arrays over GPUs
Extending VisIt for in-situ visualization
50
Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Approach Objectives
Rearchitect servers and clusters, using nonvolatile memory (NVM) to overcome resilience, energy, and performance walls in exascale computing:
Ultrafast checkpointing to nearby NVM;
Reoptimize the memory hierarchy for exascale, using new memory technologies;
Replace disk with fast, low-power NVM;
Low power cores inside the data store.
Co-design using proxy applications from Co-des ctrs.
Jeffrey Vetter, ORNL Robert Schreiber, HP Labs
Trevor Mudge, University of Michigan Yuan Xie, Penn State University
Impact
FWP #ERKJU59
A comparison of various memory technologies
Provide memory capacity needed for exascale with outstanding power, cost, and bandwidth
Address energy scalability of future exascale systems; NV memories have zero standby power.
Increase system reliability; MRAM/PCRAM are resilient to soft errors.
Create new programmer’s interfaces to NVM.
Identify and evaluate the most promising (NVM) technologies – STT, PCRAM, memristor.
Explore assembly of NVM and CMOS into a storage + memory stack.
Propose an exascale HPC system architecture that builds on our new memory architecture.
New resilience strategies in software.
Test and simulate, driven by proxy applications.
SRAM DRAM NAND
Flash
PC-RAM STT-
RAM
R-RAM
(Xpoint)
Data Retention N N Y Y Y Y
Memory Cell Factor (F2)
50-200 6 2-5 4-10 6-20 ≤ 4
Read Time (ns) < 1 30 104 10-100 10-40 5-30
Write /Erase Time (ns)
< 1 50 105 100-300 10-40 5-100
Number of Rewrites 1016 1016 104-105 108-1015 1015 106-1012
Power Read/Write Low Low High Low Low Low
Power (other than R/W)
Leakage Current
Refresh Power
None None None None
51
Formal Problem Statement Challenges
Understand and mitigate limitations of NVM as memory: higher write latency, lower endurance than SRAM/DRAM.
Analytical/simulation hybrid model to understand trade-offs between energy efficiency, resilience, and performance.
Evaluate productivity of programming models that exploit NVM to guarantee correctness in face of hw/sw failures.
• Proposes a new distributed architecture in which:
1. Mechanical-disk-based data-stores are completely replaced with energy-efficient non-volatile memories;
2. Compute capacity, comprised of balanced low-power simple cores, is co-located with the data store;
3. Most levels of the hierarchy, including DRAM and last levels of SRAM cache, are completely eliminated.
• Addresses device scalability and energy efficiency of charge-based memories, eliminates the problem of increasing DRAM soft-error rates, revolutionizes storage.
Co-design using proxy applications.
Understand likely failure scenarios and probabilities at exascale.
Simulators, simulations, and tests.
New system architecture at node level.
Programming layer software.
Analyses of NVM technologies; improved low-level architectures that use them.
Research Products/Artifacts Potential NV-RAM DIMM Layout
John Smith, ORNL Jane Doe, University of Tennessee
Jeffrey Vetter, ORNL Robert Schreiber, HP Labs
Trevor Mudge, University of Michigan Yuan Xie, Penn State University
Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Summary
Today’s HPC landscape is growing more diverse
Projections for Exascale architectures are extreme – Scalability to 1B threads
– Move to diverse, heterogeneous architectures
– Limited memory capacity and bandwidth
Emerging technologies – GPUs
– NVRAM technologies may offer a solution
Lessons
– Performance bounds: SHOC Benchmarks
– Software infrastructure plays a pivotal role in Exascale success • Directive based compilation, Tools, GA-GPU
Contributors and Sponsors
Future Technologies Group: http://ft.ornl.gov
US National Science Foundation Keeneland Project: http://keeneland.gatech.edu
US Department of Energy Office of Science
– DOE Vancouver Project: https://ft.ornl.gov/trac/vancouver
– DOE Blackcomb Project: https://ft.ornl.gov/trac/blackcomb
– DOE ExMatEx Codesign Center: http://codesign.lanl.gov
– DOE Cesar Codesign Center: http://cesar.mcs.anl.gov/
– DOE Exascale Efforts: http://science.energy.gov/ascr/research/computer-science/
Scalable Heterogeneous Computing Benchmark team: http://bit.ly/shocmarx
US DARPA NVIDIA Echelon
NVIDIA CUDA Center of Excellence
International Exascale Software Project: http://www.exascale.org/iesp/Main_Page