Tuning Stencils

29
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB Tuning Stencils Kaushik Datta Microsoft Site Visit April 29, 2008

description

Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself) - PowerPoint PPT Presentation

Transcript of Tuning Stencils

Page 1: Tuning Stencils

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Tuning Stencils

Kaushik Datta

Microsoft Site Visit

April 29, 2008

Page 2: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Stencil Code Overview

For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)

A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

2D Stencil 3D Stencil

Page 3: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Stencil Applications

Stencils are critical to many scientific applications: Diffusion, Electromagnetics, Computational Fluid Dynamics Both uniform and adaptive block-structured meshes

Many type of stencils 1D, 2D, 3D meshes Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…) Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)

Varying boundary conditions (constant vs. periodic)

Page 4: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve Stencil Code

void stencil3d(double A[], double B[], int nx, int ny, int nz) {for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] +

S1*(A[top] + A[bottom] + A[left] + A[right] +

A[front] + A[back]); }

} }}

Page 5: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Our Stencil Code

Executes a 3D, 7-point, Jacobi iteration on a 2563 grid Performs 8 flops (6 adds, 2 mults) per point Parallelization performed with pthreads Thread affinity: multithreading, then multicore, then multisocket Flop:Byte Ratio

0.33 (write allocate architectures) 0.5 (Ideal)

Page 6: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Cache-Based Architectures

Intel Clovertown

Sun Victoria Falls

AMD Barcelona

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

667MHz DDR2 DIMMs

10.66 GB/s

QuickTime™ and a decompressor

are needed to see this picture.

2x64b memory controllers

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

667MHz DDR2 DIMMs

10.66 GB/s

QuickTime™ and a decompressor

are needed to see this picture.

2x64b memory controllers

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

2MB Shared quasi- victim (32 way)

QuickTime™ and a decompressor

are needed to see this picture.

SRI / crossbar

QuickTime™ and a decompressor

are needed to see this picture.

2MB Shared quasi- victim (32 way)

QuickTime™ and a decompressor

are needed to see this picture.

SRI / crossbar

QuickTime™ and a decompressor

are needed to see this picture.

Page 7: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuning

Provides a portable and effective method for tuning Limiting the search space:

Searching the entire space is intractable Instead, we ordered the optimizations appropriately for a given platform To find best parameters for a given optimization, performed exhaustive

search Each optimization was applied on top of all previous optimizations In general, can also use heuristics/models to prune search space

Page 8: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naive Code

Naïve code is a simple, threaded stencil kernel Domain partitioning performed only in least contiguous dimension No optimizations or tuning was performed

x

y

z (unit-stride)

Page 9: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Naïve

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive

Page 10: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

NUMA-Aware

Intel Clovertown

Sun Victoria Falls

AMD Barcelona

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

667MHz DDR2 DIMMs

10.66 GB/s

QuickTime™ and a decompressor

are needed to see this picture.

2x64b memory controllers

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

667MHz DDR2 DIMMs

10.66 GB/s

QuickTime™ and a decompressor

are needed to see this picture.

2x64b memory controllers

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

Opteron

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

512KB

victim

QuickTime™ and a decompressor

are needed to see this picture.

2MB Shared quasi- victim (32 way)

QuickTime™ and a decompressor

are needed to see this picture.

SRI / crossbar

QuickTime™ and a decompressor

are needed to see this picture.

2MB Shared quasi- victim (32 way)

QuickTime™ and a decompressor

are needed to see this picture.

SRI / crossbar

QuickTime™ and a decompressor

are needed to see this picture.

Exploited “first-touch” page mapping policy on NUMA architectures Due to our affinity policy, benefit only seen when using both sockets

Page 11: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

NUMA-Aware

Intel Clovertown AMD Barcelona

Sun Victoria Falls Naive+NUMA-Aware

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 12: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Loop Unrolling/Reordering

Allows for better use of registers and functional units Best inner loop chosen by iterating many times over a grid size that

fits into L1 cache (x86 machines) or L2 cache (VF) should eliminate any effects from memory subsystem

This optimization is independent of later memory optimizations

Page 13: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Loop Unrolling/Reordering

Intel Clovertown AMD Barcelona

Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 14: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Padding

x

y

z (unit-stride)

Used to reduce conflict misses and DRAM bank conflicts Drawback: Larger memory footprint Performed search to determine best padding amount Only padded in unit-stride dimension

Padding

Amount

Page 15: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Padding

Intel Clovertown AMD Barcelona

Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 16: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Thread/Cache Blocking

x

y

z (unit-stride)

Thread Blocks in x: 4

Thread Blocks in z: 2Thread Blocks in y: 2

Cache Blocks in y: 2

Performed exhaustive search over all possible power-of-two parameter values

Every thread block is the same size and shape Preserves load balancing

Did NOT cut in contiguous dimension on x86 machines Avoids interrupting HW prefetchers

Only performed cache blocking in one dimension Sufficient to fit three read planes and one write plane into cache

Page 17: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Thread/Cache Blocking

Intel Clovertown AMD Barcelona

Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 18: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Software Prefetching

Allows us to hide memory latency Searched over varying prefetch distances and granularities (e.g.

prefetch every register block, plane, or pencil)

Page 19: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Software Prefetching

Intel Clovertown AMD Barcelona

Sun Victoria Falls Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 20: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SIMDization

Requires complete code rewrite to utilize 128-bit SSE registers Allows single instruction to add/multiply two doubles Only possible on the x86 machines Padding performed to achieve proper data alignment (not to avoid

conflicts) Searched over register block sizes and prefetch distances

simultaneously

Page 21: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

SIMDization

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Page 22: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Cache Bypass

Writes data directly to write-back buffer No data load on write miss

Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2 Reduces memory data traffic by 33%

Still requires the SIMDized code from the previous optimization Searched over register block sizes and prefetch distances

simultaneously

Page 23: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Cache Bypass

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Page 24: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Collaborative Threading

x

y

z (unit-stride)

Thread Blocks in x: 4

Thread Blocks in z: 2Thread Blocks in y: 2

Cache Blocks in y: 2

No Collaboration With Collaboration

y

z (unit-stride)

Large Coll. TBs in y: 4Large Coll. TBs in z: 2

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

t0 t1

t2 t3

t4 t5

t6 t7

Small Coll. TBs in y: 2Small Coll. TBs in z: 4

Requires another complete code rewrite CT allows for better L1 cache utilization when switching threads Only effective on VF due to:

very small L1 cache (8 KB) shared by 8 HW threads lack of hardware prefetchers (allows us to cut in contiguous dimension)

Drawback: Parameter space becomes very large

Page 25: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Collaborative Threading

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

Page 26: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Autotuning Results

Intel Clovertown AMD Barcelona

Sun Victoria Falls

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

0

1

2

3

4

5

6

7

8

9

1 2 4 8

GFlops/s

Naive+NUMA-Aware+Loop Unrolling/Reordering+Padding+Thread/Cache Blocking+Prefetching+SIMDization+Cache Bypass+Collaborative Threading

0

1

2

3

4

5

6

7

8

9

1 2 4 8 16

GFlops/s

1.9x Better 5.4x Better

10.4x Better

Page 27: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Architecture Comparison

4.5

16

12

34

53

3

0

10

20

30

40

50

60

Total GFlop/s

ClovertownBarcelonaVictoria FallsCellG80G80/PCIe

2.5

8

7

14

0

2

4

6

8

10

12

14

16

Total GFlop/s

ClovertownBarcelonaVictoria FallsCell

0

50

100

150

200

250

300

CtownBarc

VFallsCell G80

G80/P

MFlop/S/Watt

System PowerEfficiencyChip PowerEfficiency

Series12

Series11

Series10

Series9

Series8

Series7

Series6

Series5

Series4

Series3

Series2

Series1

0

10

20

30

40

50

60

70

80

CtownBarc

VFallsCell

MFlop/s/Watt

System Power EffChip Power Eff

G80:ChpG80:SysCell:ChpCell:SysVFall:ChpVFall:SysBarc:ChpBarc:SysCtwn:ChpCtwn:Sys

Single PrecisionDouble PrecisionP

erfo

rman

ceP

ow

er E

ffic

ien

cy

Page 28: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Conclusions

Compilers alone fail to fully utilize system resources Programmers may not even know that system is being underutilized Autotuning provides a portable and effective solution

Produces up to a 10.4x improvement over compiler alone To make autotuning tractable:

Choose the order of optimizations appropriately for the platform Prune the search space intelligently for large searches

Power efficiency has become a valuable metric Local store-based architectures (e.g Cell and G80) usually more

efficient than cache-based machines

Page 29: Tuning Stencils

EECSElectrical Engineering and

Computer Sciences BERKELEY PAR LAB

Acknowledgements

Sam Williams for: writing the Cell stencil code guiding my work by autotuning SpMV and LBMHD

Vasily Volkov for writing the G80 CUDA code Kathy Yelick and Jim Demmel for general advice and feedback