Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking...
Transcript of Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking...
Thinking Outside of the Tera-Scale Box
Piotr Luszczek
Brief History of Tera-flop: 1997
1997
ASCI Red
Brief History of Tera-flop: 2007
1997
2007
ASCI Red
Intel Polaris
Brief History of Tera-flop: GPGPU
2008
1997
2007
ASCI Red
Intel Polaris
GPGPU
Tera-flop: a Dictionary Perspective • Belongs to
supercomputer expert • Consumes 500 kW • Costs over $1M • Requires distributed
memory programming
• Memory: 1 GiB or more
• Belongs to gaming geek
• Consumes under 500 W • Costs under $200 • Requires only a CUDA
kernel
• Memory: 1 GiB or more
Double-Dip Recession?
2007 2005
Time
Productivity
Hybrid Multicore Multicore
Cilk, Intel TBB, Intel Ct, Apple
GCD, MS Concrt, …
HMPP, PGI Accelerator, PathScale?
Locality Space in HPC Challenge HPL Mat-Mat-Mul
Parallel-Transpose STREAM
FFT RandomAccess
Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul
Parallel-Transpose STREAM
FFT RandomAccess Latency-bound
Bandwidth-‐bound
Compute-bound
AORSA2D
PCIe
CUDA vs. OpenCL: Similar Software Stacks
Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)
CUBLAS Volkov et al. MAGMA Nakasato
Fermi C2050: from CUDA to OpenCL
0
100
200
300
400
500
600
700
Perf
orm
ance
[Gflo
p/s]
Problem size
CUDA, texture CUDA, NO texture OpenCL, no change OpenCL, good unrolling OpenCL, bad unrolling
Fermi C2050: Simple OpenCL Porting
0
100
200
300
400
500
600
192
384
576
768
960
1152
13
44
1536
17
28
1920
21
12
2304
24
96
2688
28
80
3072
32
64
3456
36
48
3840
40
32
4224
44
16
4608
48
00
4992
51
84
5376
55
68
5760
59
52
Gflo
p/s
Problem size
Native OpenCL Mat-mul
Ported OpenCL Mat-mul
ATI 5870: Simple OpenCL Porting
0
200
400
600
800
1000
1200
1400
1600
192
384
576
768
960
1152
13
44
1536
17
28
1920
21
12
2304
24
96
2688
28
80
3072
32
64
3456
36
48
3840
40
32
4224
44
16
4608
48
00
4992
51
84
5376
55
68
5760
59
52
Gflo
p/s
Problem size
Native OpenCL Mat-mul
Ported OpenCL Mat-mul (inlined indexing)
Ported OpenCL
Making Case for Autotuning: Use of Texture
0
100
200
300
400
500
600 G
flop/
s
Problem size
Fermi C2050 OpenCL texture
Fermi C2050 OpenCL NO texture
Making Case for Autotuning: Blocking Factors
Provided by NVIDIA
Mat-Mul Portability Summary
Achieved Performance Single precision
Achieved Performance Double precision
ATI OpenCL 52% 57% ATI IL 74% 87% NVIDIA CUDA 63% 58% NVIDIA OpenCL 57% 49% Multicore (vendor) 79% 79% Multicore (autotuned) 46% 40%
Now, let’s think outside of the box
Recipe for Performance Before Multicore
Wider superscalar instr. issue
Increase clock frequency
Increase number of transistor
Load/Store
Instr. scheduling
Recipe for Performance After Multicore
More cores
Parallel software
Recipe for Future Performance
More cores
Better sequential software
DAG Runtime
From Assembly to DAG: PLASMA • Typical loop in assembly
– load R0, #1000 – divf F1, F2, F3 – load R1, #1000 – divf F2, F4, F5 – dec R1 – jump …
• LU factorization – for i = 1:N – GETR2(A[i, i]) – for j = i+1:N – TRSM(A[i, i], A[i, j])
– for k = i+1:N – GEMM(A[k,i],A[i,j],A[k,j]) Instr. scheduling
GETRF2
TRSM
GETRF2
GEMM GEMM
TRSM GEMM GEMM DAG scheduling
Sequential instruction
stream
Sequential task
stream
Scalability of LU Factorization on Multicore
0
5
10
15
20
25
30
35
40
45
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0
20
40
60
80
100
120
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0
20
40
60
80
100
120
140
160
180
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
PLASMA
MKL 11.0
LAPACK
6 cores = 1 socket
24 cores = 4 sockets
48 cores = 8 sockets
AMD Istanbul 2.8 GHz
Combining Multicore and GPU for QR Fact.
0
0.1
0.2
0.3
0.4
0.5
0.6
3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200
Tflo
p/s
CPUs + GPU
CPUs
GPU
AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480
Performance on Distributed Memory
0
5
10
15
20
25
30
35
108 192 432 768 1200 3072
Tflop
/s
Number of cores
0
5
10
15
20
25
30
35
108 192 432 768 1200 3072
Tflop
/s
Number of cores
0
5
10
15
20
25
30
35
108 192 432 768 1200 3072
Tflop
/s
Number of cores
QR LU
Cholesky
ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus
Cray LibSci
DPLASMA
Peak
People and Projects • Jack Dongarra • Mathieu Faverge • Jakub Kurzak • Hatem Ltaief • Stan Tomov • Asim YarKhan • Peng Du • Rick Weber
• MAGMA • PLASMA • DPLASMA and DAGuE
– George Bosilca – Aurelien Bouteiller – Anthony Danalis – Thomas Herault – Pierre Lemarinier