Thinking Outside of the Tera-Scale Box

25
Thinking Outside of the Tera- Scale Box Piotr Luszczek

description

Thinking Outside of the Tera-Scale Box. Piotr Luszczek. Brief History of Tera-flop: 1997. 1997. ASCI Red. Brief History of Tera-flop: 2007. Intel Polaris. 2007. 1997. ASCI Red. Brief History of Tera-flop: GPGPU. GPGPU. Intel Polaris. 2007. 2008. 1997. ASCI Red. - PowerPoint PPT Presentation

Transcript of Thinking Outside of the Tera-Scale Box

Page 1: Thinking Outside of the Tera-Scale Box

Thinking Outside of the Tera-Scale Box

Piotr Luszczek

Page 2: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: 1997

1997

ASCI Red

Page 3: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: 2007

1997

2007

ASCI Red

Intel Polaris

Page 4: Thinking Outside of the Tera-Scale Box

Brief History of Tera-flop: GPGPU

2008

1997

2007

ASCI Red

Intel Polaris

GPGPU

Page 5: Thinking Outside of the Tera-Scale Box

Tera-flop: a Dictionary Perspective• Belongs to

supercomputer expert• Consumes 500 kW• Costs over $1M• Requires distributed

memory programming

• Memory: 1 GiB or more

• Belongs togaming geek

• Consumes under 500 W• Costs under $200• Requires only a CUDA

kernel

• Memory: 1 GiB or more

Page 6: Thinking Outside of the Tera-Scale Box

Double-Dip Recession?

20072005

Time

Productivity

Hybrid MulticoreMulticore

Cilk, Intel TBB, Intel Ct, Apple

GCD, MS Concrt, …

HMPP, PGI Accelerator, PathScale?

Page 7: Thinking Outside of the Tera-Scale Box

Locality Space in HPC ChallengeHPLMat-Mat-Mul

Parallel-TransposeSTREAM

FFTRandomAccess

Page 8: Thinking Outside of the Tera-Scale Box

Locality Space in HPC Challenge: BoundsHPLMat-Mat-Mul

Parallel-TransposeSTREAM

FFTRandomAccessLatency-bound

Bandwidth-bound

Compute-bound

AORSA2D

PCIe

Page 9: Thinking Outside of the Tera-Scale Box

CUDA vs. OpenCL: Similar Software Stacks

Page 10: Thinking Outside of the Tera-Scale Box

Matrix-Matrix Multiply: the Simple Formfor i = 1:Mfor j = 1:N for k = 1:T C(i, j) += A(i, k) * B(k, j)

CUBLAS Volkov et al. MAGMA Nakasato

Page 11: Thinking Outside of the Tera-Scale Box

Fermi C2050: from CUDA to OpenCL

Page 12: Thinking Outside of the Tera-Scale Box

Fermi C2050: Simple OpenCL Porting

Page 13: Thinking Outside of the Tera-Scale Box

ATI 5870: Simple OpenCL Porting

Page 14: Thinking Outside of the Tera-Scale Box

Making Case for Autotuning: Use of Texture

Page 15: Thinking Outside of the Tera-Scale Box

Making Case for Autotuning: Blocking Factors

Provided by NVIDIA

Page 16: Thinking Outside of the Tera-Scale Box

Mat-Mul Portability Summary

Achieved PerformanceSingle precision

Achieved PerformanceDouble precision

ATI OpenCL 52% 57%

ATI IL 74% 87%

NVIDIA CUDA 63% 58%

NVIDIA OpenCL 57% 49%

Multicore (vendor) 79% 79%

Multicore (autotuned) 46% 40%

Page 17: Thinking Outside of the Tera-Scale Box

Now, let’s think outside of the box

Page 18: Thinking Outside of the Tera-Scale Box

Recipe for Performance Before Multicore

Wider superscalar instr. issue

Increase clock frequency

Increase number of transistor

Load/Store

Instr. scheduling

Page 19: Thinking Outside of the Tera-Scale Box

Recipe for Performance After Multicore

More cores

Parallel software

Page 20: Thinking Outside of the Tera-Scale Box

Recipe for Future Performance

More cores

Better sequential software

DAG Runtime

Page 21: Thinking Outside of the Tera-Scale Box

From Assembly to DAG: PLASMA• Typical loop in assembly

– load R0, #1000– divf F1, F2, F3– load R1, #1000– divf F2, F4, F5– dec R1– jump …

• LU factorization– for i = 1:N– GETR2(A[i, i])– for j = i+1:N– TRSM(A[i, i], A[i, j])– for k = i+1:N– GEMM(A[k,i],A[i,j],A[k,j])Instr. scheduling

GETRF2

TRSM

GETRF2

GEMM GEMM

TRSM GEMM GEMMDAG scheduling

Sequential instruction

stream

Sequential task

stream

Page 22: Thinking Outside of the Tera-Scale Box

Scalability of LU Factorization on Multicore

PLASMA

MKL 11.0

LAPACK

6 cores = 1 socket

24 cores = 4 sockets

48 cores = 8 sockets

AMD Istanbul 2.8 GHz

Page 23: Thinking Outside of the Tera-Scale Box

Combining Multicore and GPU for QR Fact.

AMD Instanbul 2.8 GHz, 24 cores and Fermi GTX 480

Page 24: Thinking Outside of the Tera-Scale Box

Performance on Distributed Memory

QR LU

Cholesky

ORNL KrakenCray XT5AMD Istanbul 2.6 GHzDual-socet 6-coreInterconnect: Seastar, 3D torus

Cray LibSci

DPLASMA

Peak

Page 25: Thinking Outside of the Tera-Scale Box

People and Projects• Jack Dongarra• Mathieu Faverge• Jakub Kurzak• Hatem Ltaief• Stan Tomov• Asim YarKhan• Peng Du• Rick Weber

• MAGMA• PLASMA• DPLASMA and DAGuE

– George Bosilca– Aurelien Bouteiller– Anthony Danalis– Thomas Herault– Pierre Lemarinier