Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking...

25
Thinking Outside of the Tera-Scale Box Piotr Luszczek

Transcript of Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking...

Page 1: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Thinking Outside of the Tera-Scale Box

Piotr Luszczek

Page 2: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Brief History of Tera-flop: 1997

1997

ASCI Red

Page 3: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Brief History of Tera-flop: 2007

1997

2007

ASCI Red

Intel Polaris

Page 4: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Brief History of Tera-flop: GPGPU

2008

1997

2007

ASCI Red

Intel Polaris

GPGPU

Page 5: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Tera-flop: a Dictionary Perspective •  Belongs to

supercomputer expert •  Consumes 500 kW •  Costs over $1M •  Requires distributed

memory programming

•  Memory: 1 GiB or more

•  Belongs to gaming geek

•  Consumes under 500 W •  Costs under $200 •  Requires only a CUDA

kernel

•  Memory: 1 GiB or more

Page 6: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Double-Dip Recession?

2007 2005

Time

Productivity

Hybrid Multicore Multicore

Cilk, Intel TBB, Intel Ct, Apple

GCD, MS Concrt, …

HMPP, PGI Accelerator, PathScale?

Page 7: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Locality Space in HPC Challenge HPL Mat-Mat-Mul

Parallel-Transpose STREAM

FFT RandomAccess

Page 8: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Locality Space in HPC Challenge: Bounds HPL Mat-Mat-Mul

Parallel-Transpose STREAM

FFT RandomAccess Latency-bound

Bandwidth-­‐bound  

Compute-bound

AORSA2D  

PCIe  

Page 9: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

CUDA vs. OpenCL: Similar Software Stacks

Page 10: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Matrix-Matrix Multiply: the Simple Form for i = 1:M for j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)

CUBLAS   Volkov  et  al.   MAGMA   Nakasato  

Page 11: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Fermi C2050: from CUDA to OpenCL

0

100

200

300

400

500

600

700

Perf

orm

ance

[Gflo

p/s]

Problem size

CUDA, texture CUDA, NO texture OpenCL, no change OpenCL, good unrolling OpenCL, bad unrolling

Page 12: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Fermi C2050: Simple OpenCL Porting

0

100

200

300

400

500

600

192

384

576

768

960

1152

13

44

1536

17

28

1920

21

12

2304

24

96

2688

28

80

3072

32

64

3456

36

48

3840

40

32

4224

44

16

4608

48

00

4992

51

84

5376

55

68

5760

59

52

Gflo

p/s

Problem size

Native OpenCL Mat-mul

Ported OpenCL Mat-mul

Page 13: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

ATI 5870: Simple OpenCL Porting

0

200

400

600

800

1000

1200

1400

1600

192

384

576

768

960

1152

13

44

1536

17

28

1920

21

12

2304

24

96

2688

28

80

3072

32

64

3456

36

48

3840

40

32

4224

44

16

4608

48

00

4992

51

84

5376

55

68

5760

59

52

Gflo

p/s

Problem size

Native OpenCL Mat-mul

Ported OpenCL Mat-mul (inlined indexing)

Ported OpenCL

Page 14: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Making Case for Autotuning: Use of Texture

0

100

200

300

400

500

600 G

flop/

s

Problem size

Fermi C2050 OpenCL texture

Fermi C2050 OpenCL NO texture

Page 15: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Making Case for Autotuning: Blocking Factors

Provided by NVIDIA

Page 16: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Mat-Mul Portability Summary

Achieved Performance Single precision

Achieved Performance Double precision

ATI OpenCL 52% 57% ATI IL 74% 87% NVIDIA CUDA 63% 58% NVIDIA OpenCL 57% 49% Multicore (vendor) 79% 79% Multicore (autotuned) 46% 40%

Page 17: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Now, let’s think outside of the box

Page 18: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Recipe for Performance Before Multicore

Wider superscalar instr. issue

Increase clock frequency

Increase number of transistor

Load/Store

Instr. scheduling

Page 19: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Recipe for Performance After Multicore

More cores

Parallel software

Page 20: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Recipe for Future Performance

More cores

Better sequential software

DAG Runtime

Page 21: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

From Assembly to DAG: PLASMA •  Typical loop in assembly

–  load R0, #1000 –  divf F1, F2, F3 –  load R1, #1000 –  divf F2, F4, F5 –  dec R1 –  jump …

•  LU factorization –  for  i  =  1:N  –       GETR2(A[i,  i])  –       for  j  =  i+1:N  –           TRSM(A[i,  i],  A[i,  j])  

–           for  k  =  i+1:N  –             GEMM(A[k,i],A[i,j],A[k,j])  Instr. scheduling

GETRF2

TRSM

GETRF2

GEMM GEMM

TRSM GEMM GEMM DAG scheduling

Sequential instruction

stream

Sequential task

stream

Page 22: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Scalability of LU Factorization on Multicore

0

5

10

15

20

25

30

35

40

45

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0

20

40

60

80

100

120

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0

20

40

60

80

100

120

140

160

180

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

PLASMA

MKL 11.0

LAPACK

6 cores = 1 socket

24 cores = 4 sockets

48 cores = 8 sockets

AMD Istanbul 2.8 GHz

Page 23: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Combining Multicore and GPU for QR Fact.

0

0.1

0.2

0.3

0.4

0.5

0.6

3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 19200

Tflo

p/s

CPUs + GPU

CPUs

GPU

AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

Page 24: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

Performance on Distributed Memory

0  

5  

10  

15  

20  

25  

30  

35  

108   192   432   768   1200   3072  

Tflop

/s  

Number  of  cores  

0  

5  

10  

15  

20  

25  

30  

35  

108   192   432   768   1200   3072  

Tflop

/s  

Number  of  cores  

0  

5  

10  

15  

20  

25  

30  

35  

108   192   432   768   1200   3072  

Tflop

/s  

Number  of  cores  

QR LU

Cholesky

ORNL Kraken Cray XT5 AMD Istanbul 2.6 GHz Dual-socet 6-core Interconnect: Seastar, 3D torus

Cray LibSci

DPLASMA

Peak

Page 25: Thinking Outside of the Tera-Scale Boxluszczek/hpec2010/luszczek_outside_tera_box.pdf · Thinking Outside of the Tera-Scale Box Piotr Luszczek . Brief History of Tera-flop: 1997 1997

People and Projects •  Jack Dongarra •  Mathieu Faverge •  Jakub Kurzak •  Hatem Ltaief •  Stan Tomov •  Asim YarKhan •  Peng Du •  Rick Weber

•  MAGMA •  PLASMA •  DPLASMA and DAGuE

–  George Bosilca –  Aurelien Bouteiller –  Anthony Danalis –  Thomas Herault –  Pierre Lemarinier