Bringing Users Along the Road to Billion Way Concurrency · 3/14/2011 2 NERSC is the Primary...

3/14/2011

1

Bringing Users Along the Road to Billion Way Concurrency

Kathy YelickNERSC Di t L B k lNERSC Director, Lawrence Berkeley

National Laboratory

EECS Department, UC Berkeley

Cover Stories from NERSC Research

T Head-Gordon2010

Geddes 2009

V.Daggett2010

E. Bylaska2010

Dorland 2010

Sugiyama 2010

A. Aspden2009

NERSC is enabling new high quality science across disciplines, with over 1,600 refereed publications last year

2

Balbuena 2009

Wang 2009

Bonoli 2009

2009

Xantheas2009

Mavrikakis 2009

3/14/2011

2

NERSC is the Primary Computing Center for DOE Office of Science

•NERSC serves a large populationOver 3000 users, 400 projects, 500 codes

•NERSC Serves DOE SC Mission–Allocated by DOE program managers

–Not limited to largest scale jobs

–Not open to non-DOE applications

•Strategy: Science First–Requirements workshops by officey–Procurements based on science codes–Partnerships with vendors to meet

science requirements

3

Physics Math + CS AstrophysicsChemistry Climate CombustionFusion Lattice Gauge Life SciencesMaterials Other

NERSC Systems for Science

Large-Scale Computing Systems

Franklin (NERSC-5): Cray XT4• 38,128 cores (quad core), ~25 Tflop/s on applications; 356 Tflop/s peak

Hopper (NERSC-6): Cray XE6 • Phase 1: Cray XT5, 668 nodes, 5344 cores• Phase 2: > 1 Pflop/s peak (late 2010), 24-core nodes

NERSC Global Filesystem (NGF)

• Uses IBM’s GPFS• 1 5 PB; 5 5 GB/s

Clusters105 Tflops combined

Carver

Analytics

4

HPSS Archival Storage• 40 PB capacity• 4 Tape libraries• 150 TB disk cache

1.5 PB; 5.5 GB/sCarver• IBM iDataplex cluster

PDSF (HEP/NP)• Linux cluster (~1K cores)

Magellan Cloud testbed

• IBM iDataplex cluster

• Euclid (512 GB shared memory)

• Dirac GPU testbed (48 nodes. Fermi)

3/14/2011

3

NERSC Roadmap

107

106 NERSC-8

NERSC-91 EF Peak

105

104

103

102 Franklin (N5)Franklin (N5) +QC36 TF Sustained

Hopper (N6)>1 PF Peak

NERSC-710 PF Peak

100 PF Peak

Pea

k Te

raflo

p/s

10

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

( )19 TF Sustained101 TF Peak

36 TF Sustained352 TF Peak

5

Users expect 10x improvement in capability every 3-4 years

Challenges to Exascale

• System power is the primary constraint for exascale (MW ~= $M)

• Concurrency (1000x today) driven by system power and density

• Memory bandwidth and capacity are under pressure

• Processor architecture is an open question but heterogeneity is likely• Processor architecture is an open question, but heterogeneity is likely

• Algorithms need to be designed to minimize data movement, not flops

• Programming model memory & concurrency new chip-level model

• Reliability and resiliency will be critical at this scale

• I/O bandwidth unlikely to keep pace with machine speed

Why are these NERSC problems?Why are these NERSC problems?• All are challenges for 100 “capacity” 100PF machines, except:

– System wide outages and bisection bandwidth• NERSC needs to guide the transition of its user community• NERSC represents broad HPC market better than any other DOE center• Hardware revolutions affect procurement strategies (Giga->Tera)

3/14/2011

4

NERSC Response to Exascale

• Vendor partnerships

• NERSC workload and procurements• NERSC workload and procurements

• Testbeds & training

• Anticipating and changing the future– Programming models

– AutotuningAutotuning

– Algorithms

– Co-Design

7

Energy Efficiency Partnerships with Synapsense and IBM

• Monitoring for energy efficiency (and reliability!)

600 Sensors for temperature, etc. Rear door heat exchangers

• Liquid cooling on IBM system uses return water from another system, with modified CDU design– Reduces cooling costs to as much as ½

– Reduces floor space requirements by 30%

Air is colder coming out than going in!8

3/14/2011

5

Center of Excellence with Cray

What should we tell NERSC users to do ?

Multicore Era: Massive on-chip concurrency necessary for reasonable power use

NERSC/Cray “Programming Models Center of Excellence” combines:

• Berkeley Lab strength in advanced programming models, multicore tuning, and application benchmarking

• Cray strength in advanced programming models, optimizing compilers, and benchmarking

reasonable power use

Immediate question: What is the best way to use Hopper node?• Flat MPI - Today’s preferred mode of operation• MPI + OpenMP, MPI + pthreads, MPI + PGAS, MPI+OpenCL, PGAS…

Paratec MPI+OpenMP Performance

18002000

Wallclock FFT "DGEMM"

200400600800

10001200140016001800

Tim

e / s G

OOD

0200

1 2 3 6 12

768 384 256 128 64OpenMP threads / MPI tasks

3/14/2011

6

fvCAM MPI+OpenMP Performance

600700800

Wallclock Dynamics Physics

0100200300400500600

Tim

e / s

GOOD

1 2 3 6 12

240 120 80 40 20

OpenMP threads / MPI tasks

GTC MPI+OpenMP Performance

30003500

pusher shift charge poisson total

50010001500200025003000

Tim

e / S

ecs

GOOD

0

1 2 4 6 12

1536 768 384 256 128OpenMP threads / MPI tasks

3/14/2011

7

12

14

B)

Less Memory Usage with OpenMPCompared to Flat MPI

12

14

GTC f CAM

4

6

8

10

Mem

ory

per

No

de

(GB

4

6

8

10

Mem

ory

per

no

de

/ GB

GTC fvCAM

0

2

1 2 3 6 12

240 120 80 40 20

M

OpenMP threads / MPI tasks

0

2

1 2 3 4 6 12

96 48 32 24 16 8

OPENMP threads / MPI tasks

Current Procurement Strategy

Use Science Applications to Measure System Performance!

NERSC-6 “Sustained System Performance (SSP)” BenchmarksPerformance (SSP) Benchmarks

CAM Climate

GAMESSQuantum Chemistry

GTCFusion

IMPACT-TAccelerator

Physics

MAESTROAstro-

physics

MILCNuclearPhysics

PARATECMaterialScience

Peak FLOPs offer little insight into delivered application performance

14

Peak FLOPs offer little insight into delivered application performanceSSP applications model delivered performance of REAL workload

“Benchmarks are only useful insofar as they model the intended workload”

Ingrid Bucher (LANL 1978)

3/14/2011

8

Algorithm Diversity

Science areasDense linear

algebra

Sparse linear

algebra

Spectral Methods (FFT)s

Particle Methods

Structured Grids

Unstructured or AMR Grids

AcceleratorScience XX XX XX XX XXScience

Astrophysics XX XX XX XX XX XX

Chemistry XX XX XX XX

Climate XX XX XX

Combustion XX XX

F i XX XX XX XX XXFusion XX XX XX XX XX

Lattice Gauge XX XX XX XX

Material Science XX XX XX XX

NERSC users require a system which performs adequately in all areas

Numerical Methods at NERSC(Caveat: survey data from ERCAP requests)

35%

Methods at NERSCPercentage of 400 Total Projects

5%

10%

15%

20%

25%

30%

16

0%

5%

3/14/2011

9

Revolutions Require a New Strategy

• Preserve Application Performance as goal– But, allow significant optimizations

– Produced optimized versions of some codes– Produced optimized versions of some codes

– Understand performance early

• E.g., for Fermi-based GPU system– Fermi is nearly as expensive as host node

– 48 nodes w/Fermi or 2x more nodes without

– Minimum: Fermi must be 2x faster than hostMinimum: Fermi must be 2x faster than host

• Estimating value of acceleration for SSP– Estimate “best possible performance”

– Successive refinement of estimates

– Stop if estimate < min threshold (2x host)17

Dirac Testbed at NERSC

• Dirac is a 48 nodes GPU cluster

• 44 Fermi nodes– 3GB memory per card

– @~144GB/s bandwidth + ECC

– ~1TF peak single precision and 500 GF/s DP

– 24 GB of memory per node

4 T l d

Paul Dirac, Nobel prize-winning Theoretical Physicist

• 4 Tesla nodes

• QDR Infiniband network

3/14/2011

10

Ri-MP2: Molecular Equilibrium Structure

• Goal: Optimize Q-Chem RI-MP2 Routine– Start with “step 4” kernel, which

dominates the running timeObservations– Observations– Impressive single node speedups– Performance results highly dependent

on molecular structureFermi GPU Racks - NERSC

x 18.8x 7.4

x 1.7

x 12.3

Jihan Kim (SciDAC-E NERSC Postdoc)

GTC (Fusion) Charge Deposition Performance

FermiTesla

Optimizing GTC (Fusion code) Charge Deposition for Multicore and GPUs

K. Ibrahim, K. Madduri, L. Oliker, S. Williams, E.-J. Im, E. Strohmaier, S. Ethier, J. Shalf

A2 A5 A10 A20 B2 B5configuration

3/14/2011

11

Fermi Memory Bandwidth

Anticipating and Influencing the Future

22

3/14/2011

12

Autotuning: Write Code Generators for Nodes

+Zy-1

z+1

1

Nearest-neighbor 7point stencil on a 3D array

Use Autotuning!

3D Grid

+Y

+X

7-point nearest neightbors

y+1

y

x-1

z-1

x+1x,y,z

Write code generators and let computers do

tuning

23

Example pattern-specific compiler: Structured grid in Ruby

• Ruby class encapsulates SG pattern

•class LaplacianKernel < Kernel• def kernel(in_grid, out_grid)• in_grid.each_interior do |point|• in_grid.neighbors(point,1).each

do |x|p– body of anonymous

lambda specifies filter function

• Code generator produces OpenMP for m lticore 86

do |x|• out_grid[point] += 0.2*x.val• end• end•end

•VALUE kern_par(int argc, VALUE* argv, VALUE self) {•unpack_arrays into in_grid and out_grid;

•#pragma omp parallel for default(shared) multicore x86– ~1000-2000x faster than

Ruby

– Minimal per-call runtime overhead

#pragma omp parallel for default(shared) private (t_6,t_7,t_8)•for (t_8=1; t_8<256-1; t_8++) {• for (t_7=1; t_7<256-1; t_7++) {• for (t_6=1; t_6<256-1; t_6++) {• int center = INDEX(t_6,t_7,t_8);• out_grid[center] = (out_grid[center]• +(0.2*in_grid[INDEX(t_6-1,t_7,t_8)]));• ...• out_grid[center] = (out_grid[center]•+(0 2*in grid[INDEX(t 6 t 7 t 8+1)]));

3/14/2011

13

PGAS Languages: Why use 2 Languages (MPI+X) when 1 will do?

• Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global

Glo

bal

ad

dre

ss s

pac

e

x: 1y:

l: l: l:

g: g: g:

x: 5y:

x: 7y: 0

p0 p1 pnp p p

• Remote put and get: never have to say “receive” • No less scalable than MPI! • Permits sharing, whereas MPI rules it out!• Gives affinity control, useful on shared and distributed memory

Hybrid Partitioned Global Address Space

•Shared Segment on Host

•Shared Segment on GPU M

•Shared Segment on Host M






•Local Segment on Host •Memory

•Processor 1

Memory

•Local Segment on GPU •Memory


•Processor 2



•Processor 3



•Processor 4


Each thread has only two shared segments

Memory Memory Memory Memory Memory Memory Memory

Each thread has only two shared segments Decouple the memory model from execution models; one

thread per CPU, vs. one thread for all CPU and GPU “cores” Caveat: type system and therefore interfaces blow up with

different parts of address space

3/14/2011

14

GASNet GPU Extension Performance

Latency Bandwidth

•Good

•Good

•Goo

•Goo

dd dd

Communication-Avoiding Algorithms

• Consider Sparse Iterative Methods• Nearest neighbor communication on a mesh• Dominated by time to read matrix (edges) from DRAM• Dominated by time to read matrix (edges) from DRAM• And (small) communication and global

synchronization events at each step

Can we lower data movement costs?• Take k steps “at once” with one matrix read

from DRAM and one communication phaseParallel implementation– Parallel implementation

O(log p) messages vs. O(k log p)

– Serial implementationO(1) moves of data moves vs. O(k)

Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin

3/14/2011

15

“Monomial” basis [Ax,…,Akx]

fails to converge

• A different polynomial basis does

Know your mathematics!

A different polynomial basis does converge

Communication-Avoiding GMRES on 8-core Clovertown

3/14/2011

16

Co-Design Before its Time

• Demonstrated during SC ‘09• CSU atmospheric model ported to

low-power core designD l C T ili i– Dual Core Tensilica processors running atmospheric model at 25MHz

– MPI Routines ported to custom Tensilica Interconnect

• Memory and processor Stats available for performance analysis

• Emulation performance advantage250x Speedup over merely function

Icosahedral mesh for algorithm scaling

– 250x Speedup over merely function software simulator

• Actual code running - not representative benchmark

General Lessons

• Early intervention with hardware designs

• Optimize for what is important: Opt e o at s po ta t

energy data movement

• Anticipating and changing the future– Programming models

– Autotuning

– AlgorithmsAlgorithms

– Co-Design

32

3/14/2011

17

NERSC Aggressive Roadmap

107

106

•Exascale + ???

•NERSC-8

•NERSC-9•1 EF Peak

105

104

103

102 •COTS/MPP + MPI (+ OpenMP)

•GPU CUDA/OpenCL•Or Manycore BG/Q, R

•Franklin (N5)•Franklin (N5) +QC•36 TF Sustained

•Hopper (N6)•>1 PF Peak

•NERSC-7•10 PF Peak

•100 PF Peak

•Pea

k Te

raflo

p/s

10

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

•COTS/MPP + MPI

( )•19 TF Sustained•101 TF Peak

•36 TF Sustained•352 TF Peak

33

•Users expect 10x improvement in capability every 3-4 years

•Three continents, three institutions

•UC Berkeley/LBNL/NERSC

•University of Heidelberg and

•National Astronomical Observatories (CAS)

•Horst Simon•Hemant Shukla

•John Shalf•ICCS Projects John Shalf•Rainer Spurzem

•ICCS Activities•Summer School Aug 2-6, 2010

•Proven Algorithmic Techniques for Many-core Processors

•Wen-Mei Hwu (UIUC) and David Kirk (NVIDIA)

•ISAAC is a three-year (2010-2013) NSF funded project to focus on research and development of infrastructure for accelerating physics and astronomy applications using and multicore architectures.

•Goal is to successfully harness the power of the parallel architectures for compute-intensive scientific problems and open

•ISAAC•Infrastructure for Astrophysics Applications Computing

j

•GRACE II

•SILK ROAD

•Workshop Nov 30 – Dec 2, 2009

•Many-core and Accelerator-based Computing for Physics and Astronomy Applications

•

p p pdoors for new discovery and revolutionize the growth of science via, Simulations, Instrumentations and Data processing /analysis

•Visit us – http://iccs.lbl.gov

3/14/2011

18

Performance Has Also Slowed, Along with Power

1.E+06

1.E+07

Transistors (in Thousands)

Moore’s Law Continues with core doubling

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05Frequency (MHz)

Power (W)

Perf

Cores

35

1.E-01

1.E+00

1970 1975 1980 1985 1990 1995 2000 2005 2010

•Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç

Memory is Not Keeping Pace

•Technology trends against a constant or increasing memory per core• Memory density is doubling every three years; processor logic is every two

• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs

•Source: David Turek, IBM

•Cost of Computation vs. Memory

•Source: IBM

36

• Question: Can you double concurrency without doubling memory?

3/14/2011

19

How to make use of 100,000 (or more!) cores?

37

7 Point Stencil Revisited

38

• Cell and GTX280 are notable for both performance and energy efficiency

•Joint work with Kaushik Datta, Jonathan Carter, Shoaib Kamil, Lenny Oliker, John Shalf, and Sam Williams

3/14/2011

20

#2: Understand your machine limits

The “roofline” model

S. Williams, D. Patterson, L. Oliker, J. Shalf, K. Yelick

39

The Roofline Performance Model

• The top of the roof is determined by peak computation rate

peak DP64.0

128.0

256.0 Generic Machine

computation rate (Double Precision floating point, DP for these algorithms)

• The instruction mix, lack of SIMD operations, ILP or failure to use other

mul / add imbalance

w/out SIMD

w/out ILP

atta

inab

le G

flop

/s

4.0

8.0

16.0

32.0

64.0

features of peak will lower attainable

0.5

1.0

1/8

actual flop:byte ratio

2.0

1/41/2 1 2 4 8 16

3/14/2011

21


peak DP64.0

128.0

256.0 Generic Machine The sloped part of the

roof is determined by peak DRAM bandwidth

mul / add imbalance

w/out SIMD

w/out ILP

atta

inab

le G

flop

/s

4.0

8.0

16.0

32.0

64.0 p(STREAM)

Lack of proper prefetch, ignoring NUMA, or other things will reduce attainable bandwidth

0.5

1.0

1/8


2.0

1/41/2 1 2 4 8 16


peak DP64.0

128.0

256.0 Generic Machine Locations of posts in the

building are determined by algorithmic intensity

mul / add imbalance

w/out SIMD

w/out ILP

atta

inab

le G

flop

/s

4.0

8.0

16.0

32.0

64.0 y g y Will vary across

algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques

0.5

1.0

1/8


2.0

1/41/2 1 2 4 8 16

3/14/2011

22

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Littl ILPbl

e G

flop

/s

16

32

64

128

ble

Gfl

op/s

16

32

64

128peak DP

mul/add imbalance

peak DP

w/out SIMD

mul/add imbalance

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

w/out SIMD

Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements

Capacity and conflict

1

2

1/16

flop:DRAM byte ratio

atta

inab

4

8

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratioat

tain

ab

4

8

1/81/4

1/2 1 2 4 8

64

128

64

128

w/out ILP

Sun T2+ T5140(Victoria Falls)

w/out ILP

IBM QS20Cell Blade

Capacity and conflict misses will severely impair flop:byte ratio

1

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out FMA

peak DP

w/out ILP

w/out SIMD

No naïve SPEimplementation

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Littl ILPbl

e G

flop

/s

16

32

64

128

ble

Gfl

op/s

16

32

64

128peak DP

mul/add imbalance

peak DP

w/out SIMD

mul/add imbalance



w/out SIMD

Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements

Capacity and conflict

1

2

1/16


atta

inab

4

8

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

4

8

1/81/4

1/2 1 2 4 8

64

128

64

128

w/out ILP


w/out ILP

IBM QS20Cell Blade

Capacity and conflict misses will severely impair flop:byte ratio

1

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out FMA

peak DP

w/out ILP

w/out SIMD


3/14/2011

23

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

Clovertown has huge ble

Gfl

op/s

16

32

64

128

ble

Gfl

op/s

16

32

64

128peak DP

mul/add imbalance

peak DP

w/out SIMD

mul/add imbalance



w/out SIMD

C ove tow as ugecaches but is pinned to lower BW ceiling

Cache management is essential when capacity/thread is low

1

2

1/16


atta

inab

4

8

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratioat

tain

ab

4

8

1/81/4

1/2 1 2 4 8

64

128

64

128

w/out ILP


w/out ILP

IBM QS20Cell Blade

1

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out FMA

peak DP

w/out ILP

w/out SIMD


Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit

Use cache bypass instruction: movntpdbl

e G

flop

/s

16

32

64

128

ble

Gfl

op/s

16

32

64

128peak DP

mul/add imbalance

peak DP

w/out SIMD

mul/add imbalance



w/out SIMD

st uct o : ov tpd Increases flop:byte

ratio to ~0.5 on x86/Cell

1

2

1/16


atta

inab

4

8

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

4

8

1/81/4

1/2 1 2 4 8

64

128

64

128

w/out ILP


w/out ILP

IBM QS20Cell Blade

1

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 81

2

1/16


atta

inab

le G

flop

/s

4

8

16

32

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out FMA

peak DP

w/out ILP

w/out SIMD

3/14/2011

24

#3) Write Code Generators Rather Than Code (LBMHD)

Intel Clovertown AMD Opteron LBMHD is not always bandwidth limited: used SIMD, etc.

Sun Niagara2 (Huron) IBM Cell Blade* +SIMDization

+SW Prefetching

+Unrolling

+Vectorization

+Padding

Naïve+NUMA

•Joint work with Sam Williams, Lenny Oliker, John Shalf, and Jonathan Carter

#4) Used Optimized Librarites: ( Sparse Matrix Vector Multiplication)

• Sparse Matrix– Most entries are 0.0– Performance advantage in only

i / i hstoring/operating on the nonzeros– Requires significant meta data

• Evaluate y=Ax– A is a sparse matrix– x & y are dense vectors

• Challenges– Difficult to exploit ILP(bad for superscalar)

A x y

•Protein•FEM /

•Spheres•FEM /

•Cantilever

48

Difficult to exploit ILP(bad for superscalar), – Difficult to exploit DLP(bad for SIMD)– Irregular memory access to source vector– Difficult to load balance– Very low computational intensity (often >6 bytes/flop)

= likely memory bound

•FEM /•Accelerator

•Circuit •webbase

3/14/2011

25

Extra Work Can Improve Efficiency!

• Example: 3x3 blockingp g– Logical grid of 3x3 cells

– Fill-in explicit zeros

– Use only 1 index per block, rather than per nonzero

– Unroll 3x3 block multiplies

– “Fill ratio” = 1.5

• On Pentium III: 1.5x speedup!– (Actual mflop rate = 2.25

higher)

Naïve Parallel Implementation

• SPMD style

• Partition by rows

• Load balance by nonzeros

AMD OpteronIntel Clovertown

• N2 ~ 2.5x x86 machine

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)

50

Naïve Pthreads

Naïve

3/14/2011

26

• SPMD style





8x cores = 1.9x performance8x cores = 1.9x performance• N2 ~ 2.5x x86 machine


pp

4x cores = 1.5x performance4x cores = 1.5x performance

64x threads = 41x performance64x threads = 41x performance

51

Naïve Pthreads

Naïve

pp

4x threads = 3.4x performance4x threads = 3.4x performance

• SPMD style





1.4% of peak flops

29% of bandwidth

1.4% of peak flops

29% of bandwidth 4% of peak flops

20% f b d idth

4% of peak flops

20% f b d idth • N2 ~ 2.5x x86 machine


20% of bandwidth20% of bandwidth

25% of peak flops

39% f b d idth

25% of peak flops

39% f b d idth

52

Naïve Pthreads

Naïve

39% of bandwidth39% of bandwidth

2.7% of peak flops

4% of bandwidth

2.7% of peak flops

4% of bandwidth

3/14/2011

27

Autotuned Performance(+DIMMs, Firmware, Padding)

• Clovertown was already fully populated with DIMMs

• Gave Opteron as many DIMMs as Clovertown

• Firmware update for


Firmware update for Niagara2

• Array padding to avoid inter-thread conflict misses

• PPE’s use ~1/3 of Cell chip area

IBM Cell Blade (PPEs)Sun Niagara2 (Huron)+More DIMMs(opteron), +FW fix, array padding(N2), etc…

53

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

Autotuned Performance(+Cell/SPE version)

• Wrote a double precision Cell/SPE version

• DMA, local store blocked, NUMA aware, etc…

•AMD Opteron•Intel Clovertown

etc…• Only 2x1 and larger

BCOO• Only the SpMV-proper

routine changed

• About 12x faster (median) than using the PPEs alone. •IBM Cell Blade (SPEs)•Sun Niagara2 (Huron)

•+More DIMMs(opteron), •+FW fix, array padding(N2), etc…

54

•+Cache/TLB Blocking

•+Compression

•+SW Prefetching

•+NUMA/Affinity

•Naïve Pthreads

•Naïve

3/14/2011

28

• Wrote a double precision Cell/SPE version

• DMA, local store blocked, NUMA aware, etc…

Autotuned Performance(+Cell/SPE version)


4% of peak flops

52% of bandwidth

4% of peak flops

52% of bandwidth20% of peak flops

65% of bandwidth

20% of peak flops

65% of bandwidth etc…• Only 2x1 and larger

BCOO• Only the SpMV-proper

routine changed

• About 12x faster than using the PPEs alone.

IBM Cell Blade (SPEs)Sun Niagara2 (Huron)+More DIMMs(opteron), +FW fix, array padding(N2), etc…

54% of peak flops

57% f b d idth

54% of peak flops

57% f b d idth 40% of peak flops40% of peak flops

55

+Cache/TLB Blocking

+Compression

+SW Prefetching

+NUMA/Affinity

Naïve Pthreads

Naïve

57% of bandwidth57% of bandwidth 40% of peak flops

92% of bandwidth

40% of peak flops

92% of bandwidth

MPI vs. Threads

• On x86 machines, autotuned(OSKI) shared memory MPICH implementation rarely scales beyond 2 threads

•AMD Opteron•Intel Clovertown

• Still debugging MPI issues on Niagara2, but so far, it rarely scales beyond 8 threads.

•Sun Niagara2 (Huron)

56

•Autotuned pthreads

•Autotuned MPI

•Naïve Serial

3/14/2011

29

Optimized Sparse Kernel Interface - OSKI

• Provides sparse kernels automatically tuned for user’s matrix & machine– BLAS-style functionality: SpMV, Ax & ATy, TrSV

– Hides complexity of run-time tuning

– Includes new, faster locality-aware kernels: ATAx, Akx

• Faster than standard implementations– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x

• For “advanced” users & solver library writersy– Available as stand-alone library (OSKI 1.0.1h, 6/07)

– Available as PETSc extension (OSKI-PETSc .1d, 3/06)

– Bebop.cs.berkeley.edu/oski

Programming Model for Multicore

• These autotuned implementations, use:– Fixed set of threads (pthreads)

• “Parallel all the time”

– Shared memory• Avoid unnecessarily replication

– Logically partitioned memory with affinity• Avoid unnecessary cache coherence traffic

• What programming model offers these?

58

3/14/2011

30

PGAS Languages: Why use 2 Languages (MPI+X) when 1 will do?

• Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global

Glo

bal

ad

dre

ss s

pac

e

x: 1y:

l: l: l:

g: g: g:

x: 5y:

x: 7y: 0

p0 p1 pnp p p

• Remote put and get: never have to say “receive” • Remote function invocation? See HPCS languages

• No less scalable than MPI! • Permits sharing, whereas MPI rules it out!• One model rather than two, but if you insist on two:

• Can call UPC from MPI and vice verse (tested and used)

#5) Used Lightweight Communication

8-byte Roundtrip Latency

Use a programming model in which you can utilize bandwidth and “low” latency

Flood Bandwidth for 4KB messages

14.6

22.1

9.6 9.5

18.5

24.2

13.5

17.8

8 310

15

20

25

trip

La

ten

cy

(us

ec

)

MPI ping-pong

GASNet put+sync

Flood Bandwidth for 4KB messages

547

420

190

702

152

750

714231

763223

679

40%

50%

60%

70%

80%

90%

100%

rce

nt

HW

pe

ak

MPIGASNet

6.66.6

4.5

8.3

0

5

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Ro

un

dt

252

0%

10%

20%

30%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Pe

r

Joint work with Berkeley UPC Group

3/14/2011

31

Two-sided vs One-sided Communication

•message id •data payload

one sided put message

•two-sided message

•network

•host•CPU

• Two-sided message passing (e.g., MPI) requires matching a send with a receive to identify memory address to put data

•address •data payload

•one-sided put message• interface

•memory

p– Wildly popular in HPC, but cumbersome in some applications– Couples data transfer with synchronization

• Using global address space decouples synchronization– Pay for what you need! – Note: Global Addressing ≠ Cache Coherent Shared memory

Joint work with Dan Bonachea, Paul Hargrove, Rajesh Nishtala and rest of UPC group

Case Study Update: NAS FT

• Perform a large 3D FFT

– Represents bisection-bandwidth limited algorithms

• Builds on our previous work, but with a 2D partition

– Requires two rounds of communication rather than oneq

– Each processor communicates with O(√T) threads

• Leverage nonblocking communication– Packed: minimize messages, no overlap

– Slab: no packing, one message per plane per “other” thread

62

3/14/2011

32

FFT Performance on BlueGene/PHPC Challenge Peak as of July 09 is ~4.5 Tflops on 128k Cores

PGAS implementations consistently outperform MPI

Leveraging communication/computation 3500communication/computation overlap yields best performance More collectives in flight

and more communication leads to better performance

At 32k cores, overlap algorithms yield 17% improvement in overall

1500

2000

2500

3000

GF

lop

s

SlabsSlabs (Collective)Packed Slabs (Collective)MPI Packed Slabs

improvement in overall application time

Numbers are getting close to HPC record Future work to try to beat

the record0

500

1000

256 512 1024 2048 4096 8192 16384 32768

Num. of Cores

63

GOOD

GOOD

GOOD

FFT Performance on Cray XT4• 1024 Cores of the Cray XT4

– Uses FFTW for local FFTs

– Larger the problem size the more effective the overlap

GOOD

GOOD

GOOD

64

3/14/2011

33

#6) Avoid Synchronization

Computations as DAGsView parallel executions as the directed acyclic graph of the computation

65

CholeskyCholesky

4 x 44 x 4

QRQR

4 x 44 x 4

Slide source: Jack Dongarra

Parallel LU Factorization

Blocks 2Dblock-cyclicdistributedCompleted part of U

Panel factorizationsinvolve communicationfor pivoting Matrix-

matrixmultiplicationused here.Can be coalesced

Com

pleted part o

A(i,j) A(i,k)

A(j,i) A(j,k)

Trailing matrixof L

Trailing matrixto be updated

Panel being factored

3/14/2011

34

Event Driven Execution of Dense LU

• Ordering needs to be imposed on the schedule• Critical operation: Panel Factorization

– need to satisfy its dependencies firsty p– perform trailing matrix updates with low block numbers first– “memory constrained” lookahead

• General issue: dynamic scheduling in partitioned memory– Can deadlock memory allocator!

some edges omitted

DAG Scheduling Outperforms Bulk-Synchronous Style

UPC vs.

PLASMA on shared memory UPC on partitioned memory

ScaLAPACK

0

20

40

60

80

2x4 pr oc gr i d 4x4 pr oc gr i d

GFlo

ps

ScaLAPACK

UPC

68

• UPC LU factorization code adds cooperative (non-preemptive) threads for latency hiding– New problem in partitioned memory: allocator deadlock– Can run on of memory locally due tounlucky execution order

PLASMA by Dongarra et al; UPC LU joint with Parray Husbands

3/14/2011

35

#7) Use Scalable Algorithms

• Algorithmic gains in last decade have far outstripped Moore’s Law

–Adaptive meshesrather than uniform

–Sparse matrices rather than dense

–Reformulation of problem back to basics

• Algorithmic gains have outstripped Moore’s Lawoutst pped oo e s a

• Example of canonical “Poisson” problem on n points:–Dense LU: most general, but O(n3) flops on O(n2) data–Multigrid: fastest/smallest, O(n) flops on O(n) data

Performance results: John Bell et al

Conclusions

• Use your communication systems effectivelyy– On-chip communication between cores

– Communication between processor and memory

– Communication between sockets

Communication between nodes– Communication between nodes

• Work with experts on hardware, software, algorithms, applications

70

3/14/2011

36

More Info

• The Berkeley View/Parlab– http://view.eecs.berkeley.edu

http://parlab eecs berkeley edu/– http://parlab.eecs.berkeley.edu/

• Berkeley Autotuning and PGAS projects– http://bebop.cs.berkeley.edu– http://upc.lbl.gov– http://titanium.cs.berkeley.edu

• NERSC System Architecture Groupy p– http://www.nersc.gov/projects/SDSA

• LBNL Future Technologies Grouphttp://crd.lbl.gov/ftg

Bringing Users Along the Road to Billion Way Concurrency · 3/14/2011 2 NERSC is the Primary...

Documents

Transcript of Bringing Users Along the Road to Billion Way Concurrency · 3/14/2011 2 NERSC is the Primary...