Supercomputing at 1/10th the Cost - Artificial Intelligence Computing … · 2012-05-31 · CUDA...
Transcript of Supercomputing at 1/10th the Cost - Artificial Intelligence Computing … · 2012-05-31 · CUDA...
2
Amazingly fluid,
real-time video editing
Quick preview of real time
edits and effects
Realistic preview of final
content
Faster encoding
Mainstream Applications Going ParallelCUDA Accelerates Adobe Mercury Playback Engine
3
GPU Innovation Accelerating
0
100
200
300
400
500
600
700
800
900
2000 2001 2002 2003 2004 2005 2006 2007 2008
$ M
illions
NVIDIA R&D Budget
4
5
Max Planck
Institute
National Center for
Supercomputing Applications
Tokyo Institute
of TechnologyCSIRO
6
Dawning Nebulae
Second Fastest Supercomputer in the World
1.27 Petaflop
4640 Tesla GPUs
7
8
Fermi
Lab
CSIRO
256 GPUs
NCSA
384 GPUsChinese Academy
of Sciences
2000+ GPUs
Daresbury LabPNNL
256 GPUs
National
Taiwan Univ
Max Planck
Institute
Argonne
Lab
Harvard
Oxford
Jefferson LabsGeorgia Tech
TACC
Delaware
OSC Maryland
Johns Hopkins
WestGrid
Stanford
UNC
NERSC
Prospective Deployment
Existing Deployment
IIT Delhi
CEA
Tokyo Tech
680 GPUs
Peking
University
Univ of
Science & Tech
Tsinghua
University
NCHC
Curtin
University
Swinburne
University
Riken
220 GPUs
Osaka
Nagasaki
KISTI
Anna
Univ
IIT
Madras
Indian Institute
of Science
NIT Calicut
Dept of
Space
LRDE
Indian Inst
of Tropical
Meteorology
Nizhegorodsky
University
Kazan Univ
St. Petersburg
University
Institute of
Physics
Aarhus
Norwegian
Univ of S & T
Braunschweig
Copenhagen
Oak Ridge
WisconsinVaTech
Cambridge
Groningen
SNU
Yonsei
Utah
Berkeley
9
NVIDIA Tesla 20-Series Products
Data Center Workstation
10
GPU Computing
CPU + GPU Co-Processing
4 cores
CPU48 GigaFlops (DP)
GPU515 GigaFlops (DP)
11
NVIDIA Developer Eco-System
Debuggers& Profilers
cuda-gdbNV Visual Profiler
Parallel NsightVisual Studio
AllineaTotalView
MATLABMathematicaNI LabView
pyCUDA
Numerical Packages
CC++
FortranOpenCL
DirectComputeJava
Python
GPU Compilers
PGI AcceleratorCAPS HMPP
mCUDAOpenMP
ParallelizingCompilers
BLASFFT
LAPACKNPP
VideoImagingGPULib
Libraries
OEM Solution ProvidersGPGPU Consultants & Training
ANEO GPU Tech
12
146X
Medical Imaging
U of Utah
36X
Molecular Dynamics
U of Illinois, Urbana
18X
Video Transcoding
Elemental Tech
50X
Matlab Computing
AccelerEyes
100X
Astrophysics
RIKEN
149X
Financial simulation
Oxford
47X
Linear Algebra
Universidad Jaime
20X
3D Ultrasound
Techniscan
130X
Quantum Chemistry
U of Illinois, Urbana
30X
Gene Sequencing
U of Maryland
50x – 150x
13
Workstations
2 to 4 Tesla GPUs
Integrated CPU-GPU
Servers & Blades
OEM CPU Server +
Tesla S-series 1U
Tesla Data Center & Workstation GPU Solutions
Tesla S-series 1U SystemsS2050 S1070
Tesla M-series GPUsM2070 M2050 M1060
Tesla C-series GPUsC2070 C2050 C1060
14
Soul of a Supercomputer
0
5
10
15
20
25
Nehalem
4-core CPU
Westmere
6-core CPU
Tesla
10-series
Tesla
20-series
Tera
flo
ps
Linpack Teraflops per Rack
15
Heterogeneous = 2x Performance / Watt
Gigaflops/watt
X86GPU
16
8x Higher Linpack
80.1
656.1
0
150
300
450
600
750
CPU Server GPU-CPU Server
PerformanceGflops
11
60
0
10
20
30
40
50
60
70
CPU Server GPU-CPU Server
Performance / $Gflops / $K
146
656
0
200
400
600
800
CPU Server GPU-CPU Server
Performance / wattGflops / kwatt
CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw
GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw
17
4.5x Lower Power & Cooling Costs37 TeraFlop System : Top 150 System
$117 K $524 K4.5x Power Savings every Year
7x Less Space Required2 Racks of GPU+CPUs 15 Racks of CPUs
$740 K $3.8 M5x Lower Cost
As per November 2009 Top 500 List
18
What if Every Supercomputer Had Fermi?
0
200
400
600
800
1000
Linpack
Teraflops
Top 500 Supercomputers (Nov 2009)
150 GPUs
37 TeraFlops
$740K
Top 150
225 GPUs
55 TeraFlops
$1.1 M
Top 100
450 GPUs
110 TeraFlops
$2.2 M
Top 50
19
Successful Customers
GPU vs CPU
Improvements
Performance / Watt 18x - 27x 12x - 17x
Performance / Space 20x - 31x 15x - 20x
Performance / Cost 15x - 20x 10x - 12x
Oil & Gas ISVs
20+ Oil & Gas Companies Porting to CUDA
20
Successful Customers
Finance ISVs
Finance: 10+ Banks Porting to CUDA
Several unannounced
31.05 secs
0.41 secs
0.25 secs
0 5 10 15 20 25 30 35
Intel Xeon (2.6 GHz)
1 TeslaC1060
2 TeslaC1060s
Time (secs)
Derivative Pricing using SciFinanceBasket Equity-Linked Structured Note
124x
77x
UnRisk
21
Defense / Federal Agencies
Opportunities
Defense Contractors
Federal Agencies
Defense services
Speedups 10x-50x
GIS
Manifold, PCI Geomatics, DigitalGlobe
Signal Processing
GPU VSIPL
MATLAB
GPU Plugin available
UAV video analysis
MotionDSP Ikena
Virtual Prototyping
RealityServer
Surveillance, Cryptography
Software Available
22
Tesla Bio WorkBench : Bio-Chemistry & Bio-Informatics
Applications
Community
Platforms
Technical
papers
Discussion
Forums
Benchmarks
& Configurations
Tesla Personal Supercomputer Tesla GPU Clusters
Download,
Documentation
MUMmerGPU
LAMMPS CUDA-BLASTP
TeraChem
CUDA-EC
CUDA-MEME
Hex (Docking)
23
Increasing Number of CUDA Applications
CUDA C/C++, PGI Fortran
MathworksMATLAB
Nsight Visual Studio IDE
Agilent ADS Spice Simulator
Verilog SimulatorElectro-magnetics:
Agilent, CST, Remcom, SPEAG
Al r e a d y Av a i l a b l e
Mathematica
MSCMARC
MSCNastran
AMBER, GROMACS, GROMOS, HOOMD, LAMMPS, NAMD, VMD
Other Popular MD code
Quantum Chem Code 2
Quantum ChemCode 1
LithographyProducts
Risk Analysis 1Scicomp
SciFinance
CULALAPACK Library
AllineaDebugger
TotalViewDebugger
AutoDeskMoldflow
Credit Risk Analysis ISV
Jacket: MATLAB Plugin
NPP PerformancePrimitives (NPP)
NAG: RNGs
Thrust: C++ Template Lib
OpenCurrent:CFD/PDE Library
mental ray with iray
Main Concept Video Encoder
OptiX Ray TracingEngine
NumeriX: CounterParty Risk
ElementalVideo Live
Seismic Analysis:ffA, HeadWave
Tools &Libraries
Oil & Gas
Bio-Chemistry
Video & Rendering
Finance
EDA
CAE
FraunhoferJPEG2K
2010Q2 Q3 Q4
Seismic Interpretation
Reservoir Simulation 2
Seismic Analysis:Geostar
3D CAD SW with iray
Acusim AcuSolveCFD
Structural Mechanics ISV
Several CAE ISVs
Platform Cluster Management
PGI AcceleratorEnhancements
CAPS HMPP Enhancements
SPICESimulator
Seismic City, Acceleware
Reservoir Simulation 1
Moldex3D
Protein DockingCUDA-BLASTP, GPU-HMMER,
MUMmerGPU, MEME, CUDA-ECShort-read seq
analysisHex Protein
DockingBio-
Informatics
BigDFT, ABINIT, TeraChem
Announced Product
UnannouncedProduct
Released Product
Trading Platform ISVRisk Analysis 2
Bright Cluster Management
24
Bloomberg: Bond Pricing
$144K $4 Million28x Lower Cost
48 GPUs 2000 CPUs42x Lower Space
$31K / year $1.2 Million / year38x Lower Power Cost
25
Oil & Gas: Seismic Processing
~$400 K ~$8 M
45 kWatts 1200 kWatts27x Lower Power
20x Lower Cost
Equal Performance1 1
32 Tesla S1070s 2000 CPU Servers31x Less Space
26
Finding a Better Shampoo
1 kWatt 21 kWatts
$2 K $37 K
$7 K $128 K
No Data Center Required
19x Lower Power & Cooling Cost
13x Lower System Cost
Equal PerformanceTesla PSC 32 CPU Servers
27
Reducing Radiation in CT
28
Flow Cytometry : Finding Cancer Cells
Flow Cytometer
29
Post-Katrina Hurricane: What Google Showed
30
What was really there : Digital Globe
31
Explosive Growth in CUDA Developers
Teaching Parallel Programming using NVIDIA GPUs
200,000 CUDA Developer Downloads
32
1000+ Applications and Papers
using NVIDIA GPUs10 GPGPU Books using NVIDIA GPUs
NVIDIA’s Commitment to GPU Computing
33
Programming GPUs
34
Doing GPGPU Right: Combination of
Hardware and Software
C OpenCLtm Direct
ComputeFortran
Java and Python
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
C++
Libraries and Middleware
cuFFT cuBLASCULA
LAPACK
NPP &
cuDPPVideo
PhysX
Physics
OptiX
Ray
Tracing
mental ray
iray
Rendering
Reality
Server
3D Web
Services
NVIDIA GPU
CUDA Parallel Computing Architecture
GPU Computing Applications
35
CUDA C/C++ Continuous Innovation
2007 2008 2009 2010
July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10
CUDA Toolkit 1.1
• Win XP 64
• Atomics support
• Multi-GPU
support
CUDA Toolkit 2.0
• Double Precision
• Compiler
Optimizations
• Vista 32/64
• Mac OSX
• 3D Textures
• HW Interpolation
CUDA Toolkit 2.3
• DP FFT
• 16-32 Conversion
intrinsics
• Performance
enhancements
CUDA Toolkit 1.0
• C Compiler
• C Extensions
• Single Precision
• BLAS
• FFT
• SDK
40 examples
CUDA
Visual Profiler 2.2
cuda-gdb
HW Debugger
Parallel Nsight
BetaCUDA Toolkit 3.0
• C++ inheritance
• Fermi arch support
• Tools updates
• Driver / RT interop
36
Compiling C for CUDA Applications
void serial_function(… ) {
...
}
void other_function(int ... ) {
...
}
void saxpy_serial(float ... ) {
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
void main( ) {
float x;
saxpy_serial(..);
...
}
NVCC
(Open64)CPU Compiler
C CUDA
Key Kernels
CUDA object
files
Rest of C
Application
CPU object
filesLinker
CPU-GPU
Executable
Modify into
Parallel
CUDA code
37
C for CUDA : C with a few keywords
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
38
CUDA Programming Effort / Performance
Source : MIT CUDA Course
39
Parallel Nsight
Visual Studio
Visual Profiler
For Linux
cuda-gdb
For Linux
40
Targeting Multiple Platforms with CUDA
CUDA C / C++
NVCCNVIDIA CUDA Toolkit
MCUDACUDA to Multi-core
OcelotPTX to Multi-core
MCUDA: http://impact.crhc.illinois.edu/mcuda.php
Ocelot: http://code.google.com/p/gpuocelot/
Swan: http://www.multiscalelab.org/swan
SWAN CUDA to OpenCL
Other GPUs
Multi-Core
CPUs
NVIDIA
GPUs
41
Performance Benchmarks
42
8x Higher Linpack
80.1
656.1
0
150
300
450
600
750
CPU Server GPU-CPU Server
PerformanceGflops
11
60
0
10
20
30
40
50
60
70
CPU Server GPU-CPU Server
Performance / $Gflops / $K
146
656
0
200
400
600
800
CPU Server GPU-CPU Server
Performance / wattGflops / kwatt
CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw
GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw
8x 5x 4.5x
43
Performance Summary
0
20
40
60
80
100
120
Sp
ee
d-u
p
MIDG: Discontinuous Galerkin Solvers for
PDEs
0
1
2
3
4
5
6
7
8
Sp
ee
d-u
p
AMBER Molecular Dynamics (Mixed
Precision)
0
2
4
6
8
10
12
14
16
18
Sp
ee
d-u
p
Lock Exchange Problem OpenCurrent
0
1
2
3
4
5
6
7
8
Sp
ee
d-u
p
Radix Sort CUDA SDK
0
20
40
60
80
100
120
140
160
180
Sp
eed
-up
OpenEye ROCS Virtual Drug Screening
Preliminary data
44
Standard FFT Library: cuFFT 3.1
0
50
100
150
200
250
300
350
400
Gflops
Transform Size
Single Precision FFT
0
20
40
60
80
100
120
Gflops
Transform Size
Double Precision FFT
cuFFT 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
45
Standard BLAS Library: cuBLAS 3.1
0
50
100
150
200
250
300
350
400
450
256x256 512x512 1024x1024 2048x2048 4096x4096 8192x8192
Gflops
Matrix Size
Single Precision BLAS: SGEMM
0
20
40
60
80
100
120
140
160
180
256x256 512x512 1024x1024 2048x2048 4096x4096 8192x8192
Gflops
Matrix Size
Double Precision BLAS: DGEMM
cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
46
Matrix Size for Best cuBLAS Performance
0
50
100
150
200
250
300
350
400
450
500
550
600
80 x
80
400 x
400
720 x
720
1040 x
1040
1360 x
1360
1680 x
1680
2000 x
2000
2320 x
2320
2640 x
2640
2960 x
2960
3280 x
3280
3600 x
3600
3920 x
3920
4240 x
4240
4560 x
4560
4880 x
4880
5200 x
5200
5520 x
5520
5840 x
5840
6160 x
6160
6480 x
6480
6800 x
6800
7120 x
7120
7440 x
7440
7760 x
7760
8080 x
8080
Gflops SGEMM: Multiples of 80
0
50
100
150
200
250
48
x 4
8
33
6 x
33
6
62
4 x
62
4
91
2 x
91
2
12
00
x 1
20
0
14
88
x 1
48
8
17
76
x 1
77
6
20
64
x 2
06
4
23
52
x 2
35
2
26
40
x 2
64
0
29
28
x 2
92
8
32
16
x 3
21
6
35
04
x 3
50
4
37
92
x 3
79
2
40
80
x 4
08
0
43
68
x 4
36
8
46
56
x 4
65
6
49
44
x 4
94
4
52
32
x 5
23
2
55
20
x 5
52
0
58
08
x 5
80
8
60
96
x 6
09
6
63
84
x 6
38
4
66
72
x 6
67
2
69
60
x 6
96
0
72
48
x 7
24
8
75
36
x 7
53
6
78
24
x 7
82
4
81
12
x 8
11
2
Gflops DGEMM: Multiples of 48
cuBLAS 3.1: NVIDIA Tesla C1060, Tesla C2050 (Fermi)MKL 10.2.4.32: Quad-Core Intel Xeon 5550, 2.67 GHz
47
CULA 1.3 LAPACK Library from EM Photonics
0
25
50
75
100
125
2048 4096 6144 8192 10240
GL
OP
S
Matrix Size
LU Decomposition (DGETRF) Partial pivoting; Operation count estimated as 0.66N3
Core i7-920 Tesla C1060 Tesla C2050
0
10
20
30
40
50
60
2048 4096 6144 8192 10240
GL
OP
S
Matrix Size
Singular Value Decomposition (DGESVD)
Left & right singular vectors; Operation count estimated as 21N3
Core i7-920 Tesla C1060 Tesla C2050
0
25
50
75
100
125
150
2048 4096 6144 8192 10240
GL
OP
S
Matrix Size
QR Decomposition (DGEQRF) Householder method; Operation count estimated as
1.33N3
Core i7-920 Tesla C1060 Tesla C2050
Double Precision Results
Data Courtesy: EM Photonics
48
Sparse Matrix-Vector Multiplication (SpMV)
0
2
4
6
8
10
12
14
16
18
20
Gflops Double Precision
Tesla C2050 (ECC off)
Tesla C2050 (ECC on)
Tesla C1060
Intel Xeon 5550
SpMv: CUDA 3.0, Tesla C1060 and Tesla C2050
MKL 10.2: Intel Xeon 5550, 2.67 GHzPreliminary data
0
5
10
15
20
25
Gflops Single Precision
Tesla C2050 (ECC off)
Tesla C2050 (ECC on)
Tesla C1060
Intel Xeon 5550
49
Tesla GPU Computing Products
50
Fermi: The Computational GPU
Performance• 7x Double Precision of CPUs• IEEE 754-2008 SP & DP Floating Point
Flexibility
51
NVIDIA Tesla GPU Computing Products
1U Systems Workstation BoardsServer Module
Tesla M2070 /
Tesla M2050Tesla M1060 Tesla S2050 Tesla S1070
Tesla C2070 /
Tesla C2050Tesla C1060
GPUs 1 T20 GPU 1 T10 GPU 4 T20 GPUs 4 T10 GPUs 1 T20 GPU 1 T10 GPU
Single
Precision1030 GFlops 933 GFlops 4120 GFlops 4140 GFlops 1030 Gflops 933 GFlops
Double
Precision515 Gflops 78 GFlops 2060 GFlops 346 GFlops 515 Gflops 78 GFlops
Memory 6 GB / 3 GB 4 GB 12 GB (S2050)16 GB
4 GB / GPU6 GB / 3 GB 4 GB
Mem BW 148.4 GB/s 102 GB/s 148.4 GB/s 102 GB/s 144 GB/s 102 GB/s
52
In testing our key applications, the Tesla GPUs delivered speed-ups that we
had never seen before, sometimes even orders of magnitude“ ”Satoshi MatsuokaProfessorTokyo Institute of Technology
Future computing architectures will be hybrid systems with parallel-core
GPUs working in tandem with multi-core CPUs“ ”Jack DongarraProfessor, University of TennesseeAuthor of Linpack
I believe history will record Fermi as a significant milestone.“ ”Dave Patterson
Director Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach
53
Mass market adoption
Applications
Scaling