The HPC Challenge (HPCC) Benchmark Suite

11
Presented by The HPC Challenge (HPCC) Benchmark Suite Piotr Luszczek The MathWorks, Inc. http://icl.cs.utk.edu/hpcc/

description

The HPC Challenge (HPCC) Benchmark Suite. Piotr Luszczek The MathWorks, Inc. http://icl.cs.utk.edu/hpcc/. HPCC: Components. Ax=b. ------------------------------------------------------- name kernel bytes/iter FLOPS/iter - PowerPoint PPT Presentation

Transcript of The HPC Challenge (HPCC) Benchmark Suite

Page 1: The HPC Challenge (HPCC) Benchmark Suite

Presented by

The HPC Challenge (HPCC)Benchmark Suite

Piotr Luszczek

The MathWorks, Inc.

http://icl.cs.utk.edu/hpcc/

Page 2: The HPC Challenge (HPCC) Benchmark Suite

2 Luszczek_HPCC_SC07

HPCC: Components1. HPL (High Performance LINPACK)

2. STREAM

3. PTRANS A ← AT+B

4. RandomAccess

5. FFT

6. Matrix-matrix multiply

7. b_eff (effective bandwidth/latency)

------------------------------------------------------- name kernel bytes/iter FLOPS/iter------------------------------------------------------- COPY: a(i) = b(i) 16 0 SCALE: a(i) = q*b(i) 16 1 SUM: a(i) = b(i) + c(i) 24 1 TRIAD: a(i) = b(i) + q*c(i) 24 2-------------------------------------------------------

+1

-1

T: T[k] (+) ai

T[k] (+) ai

64 bits

ping

C ← s*C + t * A*B

Ax=b

zk=xj exp(-2-1 jk/n)

pong

Page 3: The HPC Challenge (HPCC) Benchmark Suite

3 Luszczek_HPCC_SC07

HPCC: Motivation and measurement

HPC Challenge BenchmarksSelect Applications

0.00

0.20

0.40

0.60

0.80

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Spatial Locality

Temporal locality

HPL

Test3D

CG

OverflowGamess

RandomAccess

AVUS

OOCore

RFCTH2

STREAM

HYCOM

Generated by PMaC @ SDSC

Spatial and temporal data locality here is for one node/processor - i.e., locally or “in the small.”

Spatial and temporal data locality here is for one node/processor - i.e., locally or “in the small.”

Hig

h

Spatial locality

Tem

po

ral

lo

cali

ty

DGEMMHPL

PTRANSSTREAM

FFT

RandomAccess

Mission partner

applications

Low High

Measurement

Concept

Page 4: The HPC Challenge (HPCC) Benchmark Suite

4 Luszczek_HPCC_SC07

HPCC: Scope and naming conventions

G

EP

S

HPL STREAM

FFT...

RandomAccess(1 m)HPL (25%)

system

CPU

thread

MM

PPPP

MM

PPPP

MM

PPPP

MM

PPPP

NetworkNetwork

MM

PPPP

MM

PPPP

MM

PPPP

MM

PPPP

NetworkNetwork

MM

PPPP

MM

PPPP

MM

PPPP

MM

PPPP

NetworkNetwork

GlobalGlobal

Embarrassingly ParallelEmbarrassingly Parallel

SingleSingle

CPU

Memory Interconnect

Computationalresources

Computationalresources

core(s)

MPI

OpenMP

SoftwaremodulesSoftwaremodules

Vectorize

Page 5: The HPC Challenge (HPCC) Benchmark Suite

5 Luszczek_HPCC_SC07

HPCC: Hardware probes

Registers

Cache

Local memory

Disk

Instr. Operands

Blocks

Pages

Remote memory

Messages

HPC ChallengeBenchmark

CorrespondingMemory Hierarchy

HPCS PerformanceTargets (improvement)

Top500: solves a system

Ax = b

STREAM: vector operations

A = B + s x C

FFT: 1D fast Fourier transform

Z = FFT(X)

RandomAccess: random updates

T(i) = XOR( T(i), r )

bandwidth

latency

2 Petaflops(8x)

6.5 Petabytes(40x)

0.5 Petaflops(200x)

64,000 GUPS(2000x)

HPCS program has developed a new suite of benchmarks (HPC Challenge).

Each benchmark focuses on a different part of the memory hierarchy.

HPCS program performance targets will flatten the memory hierarchy, improve real application performance, and make programming easier.

HPCS program has developed a new suite of benchmarks (HPC Challenge).

Each benchmark focuses on a different part of the memory hierarchy.

HPCS program performance targets will flatten the memory hierarchy, improve real application performance, and make programming easier.

Page 6: The HPC Challenge (HPCC) Benchmark Suite

6 Luszczek_HPCC_SC07

HPCC: Official submission process1. Download

2. Install

3. Run

4. Upload results

5. Confirm via @email@

6. Tune

7. Run

8. Upload results

9. Confirm via @email@

• Only some routines can be replaced.• Data layout needs to be preserved.• Multiple languages can be used.

• Only some routines can be replaced.• Data layout needs to be preserved.• Multiple languages can be used.

Provide detailedinstallation andexecution environment.

Provide detailedinstallation andexecution environment.

Results are immediately availableon the Web site:● Interactive HTML● XML● MS Excel● Kiviat charts (radar plots)

Results are immediately availableon the Web site:● Interactive HTML● XML● MS Excel● Kiviat charts (radar plots)

OptionalOptional

Prerequisites:• C compiler• BLAS• MPI

Prerequisites:• C compiler• BLAS• MPI

Page 7: The HPC Challenge (HPCC) Benchmark Suite

7 Luszczek_HPCC_SC07

HPCC: Submissions over time

10

100

1000

10000

100000

1000000

10000000

Nov-03Jan-04Mar-04May-04 Jul-04Sep-04Nov-04Jan-05Mar-05May-05 Jul-05Sep-05Nov-05Jan-06Mar-06May-06 Jul-06Sep-06

1

10

100

1000

10000

100000

Jul-04Aug-04Sep-04Oct-04Nov-04Dec-04Jan-05Feb-05Mar-05Apr-05May-05Jun-05Jul-05

Aug-05Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06

0.1

1

10

100

1000

10000

Nov-03Jan-04Mar-04May-04 Jul-04Sep-04Nov-04Jan-05Mar-05May-05 Jul-05Sep-05Nov-05Jan-06Mar-06May-06 Jul-06Sep-06

0.001

0.01

0.1

1

10

100

1000

Jul-04Aug-04Sep-04Oct-04Nov-04Dec-04Jan-05Feb-05Mar-05Apr-05May-05Jun-05Jul-05

Aug-05Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06

STREAM[GB/s]

HPL[Tflop/s]

FFT[Gflop/s]

RandomAccess[GUPS]

Sum Sum

Sum Sum

#1

#1

#1

#1

Page 8: The HPC Challenge (HPCC) Benchmark Suite

8 Luszczek_HPCC_SC07

HPCC: Comparing three interconnects

Kiviat chart (radar plot) 3 AMD Opteron clusters Clock: 2.2 GHz 64-processor cluster

Interconnect typesA. VendorB. CommodityC. GigE G-HPL Matrix-matrix multiply

Cannot be differentiated based on G-HPL Matrix-matrix multiply

Available on HPCC Web site http://icl.cs.utk.edu/hpcc/

Page 9: The HPC Challenge (HPCC) Benchmark Suite

9 Luszczek_HPCC_SC07

HPCC: Analysis of sample resultsHPCS ~102

HPC ~104

Clusters ~106

1.E+03

1.E+06

1.E+09

1.E+12

1.E+15

DARPA HPCS GoalsIBM BG/L (LLNL) OptIBM BG/L (LLNL)IBM Power5 (LLNL)Cray XT3 (ORNL)Cray XT3 (ERDC)Cray X1 (ORNL) OptCray X1 (ORNL)NEC SX-8 (HLRS)SGI Altix (NASA)Cray X1E (AHPCRC)Opteron (AMD)Dell GigE P64 (MITLL)Dell GigE P32 (MITLL)Dell GigE P16 (MITLL)Dell GigE P8 (MITLL)Dell GigE P4 (MITLL)Dell GigE P2 (MITLL)Dell GigE P1 (MITLL)

Top500 (words/s)

STREAM (words/s)

FFT (words/s)

RandomAccess (words/s)

Systems(in

Top500order)

Meg

aG

iga

Ter

aP

eta

Eff

ecti

ve B

and

wid

th

(wo

rds/

seco

nd

)

All results in words/second

Highlights memory hierarchy

Clusters Hierarchy

steepens

HPC systems Hierarchy

constant

HPCS goals Hierarchy

flattens Easier to

program

Kil

o

Page 10: The HPC Challenge (HPCC) Benchmark Suite

10 Luszczek_HPCC_SC07

HPCC: Augmenting June TOP500

TOP500 rating Data provided by HPCC database

Page 11: The HPC Challenge (HPCC) Benchmark Suite

11 Luszczek_HPCC_SC07

Contacts

Piotr LuszczekThe MathWorks, Inc.(508) [email protected]