Thinking Outside of the Tera-Scale Box

25
Thinking Outside of the Tera- Scale Box Piotr Luszczek

description

Thinking Outside of the Tera-Scale Box. Piotr Luszczek. Brief History of Tera-flop: 1997. 1997. ASCI Red. Brief History of Tera-flop: 2007. Intel Polaris. 2007. 1997. ASCI Red. Brief History of Tera-flop: GPGPU. GPGPU. Intel Polaris. 2007. 2008. 1997. ASCI Red. - PowerPoint PPT Presentation

Transcript of Thinking Outside of the Tera-Scale Box

Page 1: Thinking Outside of the Tera-Scale Box

Thinking Outside of the Tera-Scale Box

Piotr Luszczek

Page 2: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: 1997

1997

ASCI Red

Page 3: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: 2007

1997

2007

ASCI Red

Intel Polaris

Page 4: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: GPGPU

2008

1997

2007

ASCI Red

Intel Polaris

GPGPU

Page 5: Thinking Outside of the Tera-Scale Box

Tera-flop: a Dictionary Perspective• Belongs to

supercomputer expert• Consumes 500 kW• Costs over $1M• Requires distributed

memory programming

• Memory: 1 GiB or more

• Belongs togaming geek

• Consumes under 500 W• Costs under $200• Requires only a CUDA

kernel

• Memory: 1 GiB or more

Page 6: Thinking Outside of the Tera-Scale Box

Double-Dip Recession?

20072005

Time

Productivity

Hybrid MulticoreMulticore

Cilk, Intel TBB, Intel Ct, Apple

GCD, MS Concrt, …

HMPP, PGI Accelerator, PathScale?

Page 7: Thinking Outside of the Tera-Scale Box

Locality Space in HPC ChallengeHPLMat-Mat-Mul

Parallel-TransposeSTREAM

FFTRandomAccess

Page 8: Thinking Outside of the Tera-Scale Box

Locality Space in HPC Challenge: BoundsHPLMat-Mat-Mul

Parallel-TransposeSTREAM

FFTRandomAccessLatency-bound

Bandwidth-bound

Compute-bound

AORSA2D

PCIe

Page 9: Thinking Outside of the Tera-Scale Box

CUDA vs. OpenCL: Similar Software Stacks

Page 10: Thinking Outside of the Tera-Scale Box

Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)

CUBLAS Volkov et al. MAGMA Nakasato

Page 11: Thinking Outside of the Tera-Scale Box

Fermi C2050: from CUDA to OpenCL

0

100

200

300

400

500

600

700

CUDA, textureCUDA, NO textureOpenCL, no changeOpenCL, good unrollingOpenCL, bad unrolling

Problem size

Perf

orm

ance

[Gflo

p/s]

Page 12: Thinking Outside of the Tera-Scale Box

Fermi C2050: Simple OpenCL Porting

192

576

960

1344

1728

2112

2496

2880

3264

3648

4032

4416

4800

5184

5568

5952

0

100

200

300

400

500

600

Native OpenCL Mat-mul

Ported OpenCL Mat-mul

Problem size

Gflo

p/s

Page 13: Thinking Outside of the Tera-Scale Box

ATI 5870: Simple OpenCL Porting

192

576

960

1344

1728

2112

2496

2880

3264

3648

4032

4416

4800

5184

5568

5952

0

200

400

600

800

1000

1200

1400

1600

Native OpenCL Mat-mul

Ported OpenCL Mat-mul (inlined indexing)

Ported OpenCL

Problem size

Gflo

p/s

Page 14: Thinking Outside of the Tera-Scale Box

Making Case for Autotuning: Use of Texture

192

576

960

1344

1728

2112

2496

2880

3264

3648

4032

4416

4800

5184

5568

5952

6336

6720

7104

7488

7872

0

100

200

300

400

500

600

Fermi C2050 OpenCL texture

Fermi C2050 OpenCL NO texture

Problem size

Gflo

p/s

Page 15: Thinking Outside of the Tera-Scale Box

Making Case for Autotuning: Blocking Factors

Provided by NVIDIA

Page 16: Thinking Outside of the Tera-Scale Box

Mat-Mul Portability Summary

Achieved PerformanceSingle precision

Achieved PerformanceDouble precision

ATI OpenCL 52% 57%ATI IL 74% 87%NVIDIA CUDA 63% 58%NVIDIA OpenCL 57% 49%Multicore (vendor) 79% 79%Multicore (autotuned) 46% 40%

Page 17: Thinking Outside of the Tera-Scale Box

Now, let’s think outside of the box

Page 18: Thinking Outside of the Tera-Scale Box

Recipe for Performance Before Multicore

Wider superscalar instr. issue

Increase clock frequency

Increase number of transistor

Load/Store

Instr. scheduling

Page 19: Thinking Outside of the Tera-Scale Box

Recipe for Performance After Multicore

More cores

Parallel software

Page 20: Thinking Outside of the Tera-Scale Box

Recipe for Future Performance

More cores

Better sequential software

DAG Runtime

Page 21: Thinking Outside of the Tera-Scale Box

From Assembly to DAG: PLASMA• Typical loop in assembly

– load R0, #1000– divf F1, F2, F3– load R1, #1000– divf F2, F4, F5– dec R1– jump …

• LU factorization– for i = 1:N– GETR2(A[i, i])– for j = i+1:N– TRSM(A[i, i], A[i, j])– for k = i+1:N– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling

GETRF2

TRSM

GETRF2

GEMM GEMM

TRSM GEMM GEMMDAG scheduling

Sequential instruction

stream

Sequential task

stream

Page 22: Thinking Outside of the Tera-Scale Box

Scalability of LU Factorization on Multicore

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

5

10

15

20

25

30

35

40

45

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

20

40

60

80

100

120

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

20

40

60

80

100

120

140

160

180

PLASMA

MKL 11.0

LAPACK

6 cores = 1 socket

24 cores = 4 sockets

48 cores = 8 sockets

AMD Istanbul 2.8 GHz

Page 23: Thinking Outside of the Tera-Scale Box

Combining Multicore and GPU for QR Fact.

3840 5120 6400 7680 8960 10240 11520 12800 14080 15360 16640 17920 192000

0.1

0.2

0.3

0.4

0.5

0.6

CPUs + GPU

CPUs

GPU

Tflo

p/s

AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

Page 24: Thinking Outside of the Tera-Scale Box

Performance on Distributed Memory

108 192 432 768 1200 30720

5

10

15

20

25

30

35

Number of cores

Tflop

/s

108 192 432 768 1200 30720

5

10

15

20

25

30

35

Number of cores

Tflop

/s

108 192 432 768 1200 30720

5

10

15

20

25

30

35

Number of cores

Tflop

/s

QR LU

Cholesky

ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-coreInterconnect: Seastar, 3D torus

Cray LibSci

DPLASMA

Peak

Page 25: Thinking Outside of the Tera-Scale Box

People and Projects• Jack Dongarra• Mathieu Faverge• Jakub Kurzak• Hatem Ltaief• Stan Tomov• Asim YarKhan• Peng Du• Rick Weber

• MAGMA• PLASMA• DPLASMA and DAGuE

– George Bosilca– Aurelien Bouteiller– Anthony Danalis– Thomas Herault– Perre Lemarinier