Supercomputing with NVIDIA GPUs - t-systems-sfr.com · Supercomputing with NVIDIA GPUs HPCN...
Transcript of Supercomputing with NVIDIA GPUs - t-systems-sfr.com · Supercomputing with NVIDIA GPUs HPCN...
© NVIDIA Corporation 2011
Supercomputing with NVIDIA GPUsHPCN Workshop, May, 2011Axel Koehler- NVIDIA
© NVIDIA Corporation 2011
NVIDIA Introduction and HPC Evolution of GPUs
Public, based in Santa Clara, CA | ~$4B revenue | ~6000 employees
Founded in 1999 with primary business in semiconductor industry
Products for graphics in workstations, notebooks, mobile devices, etc.
Began R&D of GPUs for HPC in 2004, released first Tesla and CUDA in 2007
Development of GPUs as a co-processing accelerator for x86 CPUs
2004: Began strategic investments in GPU as HPC co-proces sor
2006: G80 first GPU with built-in compute features, 128 c ores; CUDA SDK Beta
2007: Tesla 8-series based on G80, 128 cores – CUDA 1.0, 1 .1
2008: Tesla 10-series based on GT 200, 240 cores – CUDA 2. 0, 2.3
2009: Tesla 20-series, code named “Fermi” up to 512 cores – CUDA SDK 3.0
HPC Evolution of GPUs
3 Generations ofTesla in 3 Years
© NVIDIA Corporation 2011
#1 : Tianhe-1A7168 Tesla GPU’s 2.5 PFLOPS
#3 : Nebulae4650 Tesla GPU’s 1.2 PFLOPS
We not only created the world's fastest computer, but also implemented
a heterogeneous computing architecture incorporating CPU and GPU,
this is a new innovation. ” Premier Wen JiabaoPublic comments acknowledging Tianhe-1A
“
#4 : Tsubame 2.04224 Tesla GPU’s 1.194 PFLOPS
Tesla GPUs Power 3 of Top 5 Supercomputers
© NVIDIA Corporation 2011
3 of Top5 Supercomputers
0
1
2
3
4
5
6
7
8
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100
Meg
awat
ts
Gig
aflo
ps
© NVIDIA Corporation 2011
3 of Top5 Supercomputers
0
1
2
3
4
5
6
7
8
0
500
1000
1500
2000
2500
3000
Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 100
Meg
awat
ts
Gig
aflo
ps
© NVIDIA Corporation 2011
GPU Computing TodayBy the Numbers:
CUDA Capable GPUs200+ Million
CUDA Toolkit Downloads600,000+
Active GPU Computing Developers100,000+
Members in Parallel Nsight Developer Program8,000
Universities Teaching CUDA Worldwide362
CUDA Centers of Excellence Worldwide11
© NVIDIA Corporation 2011
Wide Adoption of Tesla GPUs
FinanceGovernmentEdu/ResearchOil and gas Life Sciences Manufacturing
Reverse Time
Migration
Kirchoff Time
Migration
Reservoir Sim
Astrophysics
Molecular
Dynamics
Weather / Climate
Modeling
Signal Processing
Satellite Imaging
Video Analytics
Synthetic Aperture
Radar
Bio-chemistry
Bio-informatics
Material Science
Sequence Analysis
Genomics
Risk Analytics
Monte Carlo
Options Pricing
Insurance
modeling
Structural
Mechanics
Computational
Fluid Dynamics
Machine Vision
Electromagnetics
© NVIDIA Corporation 2011
MATLAB makes GPUs more accessible
Scientist /Practitioner
Developer /Computer Scientist
Computational Expertise Domain Expertise
MATLAB Benefits• Faster time to discovery• Empowers scientist /
practitioner• No need for programming
expertise• No custom tools• Automated application
deployment
Language Integration
CUDA C / C++
High-LevelTechnical
ComputingLanguages
1 million+ MATLAB licensees
© NVIDIA Corporation 2011
GPU Progress – CAE ISV Software
Available
Today
Product
in 2011
Product
Evaluation
Research
Evaluation
GPU Status Structural Mechanics Fluid Dynamics Electromagnetics
ANSYS Mechanical
AFEA
Abaqus/Standard
LS-DYNA implicit
Marc
MD Nastran
RADIOSS implicit
PAM-CRASH implicit
NX Nastran
RecurDyn
AcuSolve
Moldflow
Culises (OpenFOAM)
Particleworks
CFD-ACE+
Abaqus/CFD
FloEFD
STAR-CCM+
ANSYS CFD (FLUENT+CFX)
LS-DYNA
Abaqus/Explicit
RADIOSS
PAM-CRASH
CFD++
LS-DYNA CFD
Nexxim
EMPro
CST MS
XFdtd
SEMCAD X
Xpatch
HFSS
Maxwell
© NVIDIA Corporation 2011
3 billion transistors
Over 2 x the cores (512 total)
8× the peak DP performance
ECC
L1 and L2 caches
~2× memory bandwidth (GDDR5)
Up to 1 Terabyte of GPU memory
Concurrent kernels
Hardware support for C++
DR
AM
I/F
HO
ST
I/F
Gig
a Th
read
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
The ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU
© NVIDIA Corporation 2011
Workstations2 to 4 Tesla GPUs
Integrated CPU-GPU Servers & Blades
Tesla Data Center & Workstation GPU Solutions
Tesla M-series GPUsM2090 | M2070 | M2050
Tesla C-series GPUsC2070 | C2050
M2090 M2070 M2050Cores 512 448 448
Memory 6 GB 6 GB 3 GB
Memory bandwidth (ECC off)
177.6 GB/s 148.8 GB/s 148.8 GB/s
Peak PerfGflops
Single Precision
1331 1030 1030
Double Precision
665 515 515
C2070 C2050448 448
6 GB 3 GB
144 GB/s 144 GB/s
1030 1030
515 515
© NVIDIA Corporation 2011
CUDA GPU Roadmap16
2
4
6
8
10
12
14
DP GFLOPS per Watt
2007 2009 2011 2013
TeslaFermi
Kepler
Maxwell
© NVIDIA Corporation 2011
NVIDIA Developer Eco -System
C
C++
Fortran
OpenCL
DirectCompute
Java
Python
GPU Compilers
PGI Accelerator
CAPS HMPP
mCUDA
OpenMP
Parallelizing
Compilers
BLAS
FFT
LAPACK
NPP
Video
Imaging
GPULib
Libraries
GPGPU Consultants & Training
ANEO GPU Tech
Debuggers
& Profilers
cuda-gdb
NV Visual Profiler
Parallel Nsight
Visual Studio
Allinea
TotalView
VampirTrace
MATLAB
Mathematica
NI LabView
pyCUDA
Numerical
Packages
Bright Cluster
Manager
Platform LSF /
Symphony
Altair PBS Pro
Torque
GridEngine
Cluster
Tools
OEM solutions +
Cloud Platform Provider
Amazon EC2
Peer 1
© NVIDIA Corporation 2011
CUDA 4.0: Highlights
• Share GPUs across multiple threads
• Single thread access to all GPUs
• No-copy pinning of system memory
• New CUDA C/C++ features
• Thrust templated primitives library
• NPP image/video processing library
• Layered Textures
Easier ParallelApplication Porting
• Auto Performance Analysis
• C++ Debugging
• GPU Binary Disassembler
• cuda-gdb for MacOS
New & Improved Developer Tools
• Unified Virtual Addressing
• NVIDIA GPUDirect™ v2.0
• Peer-to-Peer Access
• Peer-to-Peer Transfers
• GPU-accelerated MPI
Faster Multi-GPU Programming
© NVIDIA Corporation 2011
C++ Templatized Algorithms & Data Structures (Thrust)
Powerful open source C++ parallel algorithms & data structures
Similar to C++ Standard Template Library (STL)
Automatically chooses the fastest code path at comp ile time
Divides work between GPUs and multi-core CPUs
Parallel sorting @ 5x to 100x faster
Data Structures
• thrust::device_vector
• thrust::host_vector
• thrust::device_ptr
• Etc.
Algorithms
• thrust::sort
• thrust::reduce
• thrust::exclusive_scan
• Etc.
© NVIDIA Corporation 2011
Unified Virtual Addressing Easier to Program with Single Address Space
No UVA: Multiple Memory Spaces UVA : Single Address Space
System
Memory
CPU GPU0
GPU0
Memory
GPU1
GPU1
Memory
System
Memory
CPU GPU0
GPU0
Memory
GPU1
GPU1
Memory
PCI-e PCI-e
0x0000
0xFFFF
0x0000
0xFFFF
0x0000
0xFFFF
0x0000
0xFFFF
© NVIDIA Corporation 2011
Unified Virtual Addressing
One address space for all CPU and GPU memoryDetermine physical memory location from pointer val ueEnables libraries to simplify their interfaces (e.g . cudaMemcpy)
Supported on Tesla 20-series and other Fermi GPUs64-bit applications on Linux and Windows TCC
Before UVA With UVA
Separate options for each permutation One function handles all cases
cudaMemcpyHostToHostcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice
cudaMemcpyDefault(data location becomes an implementation detail)
© NVIDIA Corporation 2011
NVIDIA GPUDirect™ :Towards Eliminating the CPU Bottleneck
• Direct access to GPU memory for 3rd
party devices
• Eliminates unnecessary sys memcopies & CPU overhead
• Supported by Mellanox and Qlogic
• Up to 30% improvement in communication performance
Version 1.0 for applications that communicate
over a network
• Peer-to-Peer memory access, transfers & synchronization
• Less code, higher programmer productivity
Details @ http://www.nvidia.com/object/software-for-tesla-products.html
Version 2.0for applications that communicate
within a node
© NVIDIA Corporation 2011
GPUDirect v2.0: Peer-to -Peer Communication
Direct Access Direct Transfers
GPU1
GPU1
Memory
GPU0
GPU0
Memory
Load / Store cudaMemcpy()
GPU0
GPU0
Memory
GPU1
GPU1
Memory
PCI-e PCI-e
© NVIDIA Corporation 2011
GPUDirect v2.0: Peer-to -Peer Communication
Direct communication between GPUsFaster - no system memory copy overheadMore convenient multi-GPU programming
Direct TransfersCopy from GPU 0 memory to GPU 1 memoryWorks transparently with UVA
Direct AccessGPU0 reads or writes GPU 1 memory (load/store)
Supported only on Tesla 20-series (Fermi)64-bit applications on Linux and Windows TCC
© NVIDIA Corporation 2011
EchelonNVIDIA’s Extreme-Scale Computing Project
© NVIDIA Corporation 2011
Optimize the Storage Hierarchy2
Tailor Memory to the Application3
Data Movement Dominates Power1
Power is THE Problem
© NVIDIA Corporation 2011
Applications with Hierarchical Reuse Want a Deep Storage Hierarchy
P P P P P P P P P P P P P P P P
L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L3
L4
© NVIDIA Corporation 2011
Applications with Plateaus Want a Shallow Storage Hierarchy
P P P P P P P P P P P P P P P P
NoC
L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
© NVIDIA Corporation 2011
Configurable Memory Can Do BothAt the Same Time
Flat hierarchy for large working setsDeep hierarchy for reuse“Shared” memory for explicit managementCache memory for unpredictable sharing
P
L1
SRAM SRAM SRAM SRAM
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
NoC
© NVIDIA Corporation 2011
Lane - DFMAs, 20 GFLOPS
P P P P P P P P
Switch
L1$
SM - 8 lanes, 160 GFLOPS
1024 SRAM Banks, 256KB each
NIMC MC
SM SM SM SM
NoC
SM LP LP
SRAM SRAM SRAM
Chip – 128 SMs, 20.48 TFLOPS + 8 Latency Processors
GPU Chip20TF DP256MB
GPU Chip20TF DP256MB
1.4TB/sDRAM BW
150GB/sNetwork BW
DRAMStack
DRAMStack
DRAMStack
NVMemory
Node MCM – 20 TFLOPS + 256 GB
Echelon Architecture
© NVIDIA Corporation 2011
Echelon System Sketch
Self-Aware OS
Self-Aware Runtime
Locality-AwareCompiler & Autotuner
Echelon System , 400 Cabinets, 1 EF, 15 MW)Cabinet 0 (C0) , 16 Modules, 2.6PF, 205TB/s, 32TB
Module 0 (M)) , 8 Nodes, 160TF, 12.8TB/s, 2TB M15Node 0 (N0) 20TF, 1.6TB/s, 256GB
Processor Chip (PC)
L0
C0
SM0
L0
C7
NoC
SM127
MC NICL20 L21023
DRAMCube
DRAMCube
NV RAM
High-Radix Router Module (RM)
CN
Dragonfly Interconnect (optical fiber)
N7
LC0
LC7
© NVIDIA Corporation 2011
GPU Computing Enables Ex aScaleAt Reasonable Power2
The GPU is the ComputerA general purpose computing engine, not just an accelerator3
GPU Computing is #1 TodayOn Top 500 AND Dominant on Green 5001
GPU Computing is the Future
The Real Challenge is Software4
© NVIDIA Corporation 2011
Supercomputing with NVIDIA GPUsHPCN Workshop, May, 2011Axel Koehler- NVIDIA