B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A...

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

Performance Understanding, Prediction, and Tuning

at the Berkeley Institute for Performance

Studies (BIPS)

Katherine Yelick, BIPS DirectorLawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.

National Science

Foundation

Berkeley Institute for Performance Studies 2



Challenges to Performance

Two trends in High End Computing• Increasing complicated systems

• Multiple forms of parallelism• Many levels of memory hierarchy• Complex systems software in between

• Increasingly sophisticated algorithms• Unstructured meshes and sparse matrices• Adaptivity in time and space• Multi-physics models lead to hybrid approaches

• Conclusion: Deep understanding of performance at all levels is important




BIPS Institute Goals

• Bring together researchers on all aspects of performance engineering

• Use performance understanding to:• Improve application performance• Compare architectures for application

suitability• Influence the design of processors,

networks and compilers• Identify algorithmic needs




BIPS Approaches

• Benchmarking and Analysis• Measure performance • Identify opportunities for improvements in

software, hardware, and algorithms

• Modeling• Predict performance on future machines • Understand performance limits

• Tuning• Improve performance • By hand or with automatic self-tuning tools




Multi-Level Analysis

• Full Applications• What users want• Do not reveal

impact of features

• Compact Applications• Can be ported with

modest effort

Next GenApps

Full Apps

Compact Apps

Micro-Benchmarks

System Size and Complexity

• Easily match phases of full applications

• Microbenchmarks• Isolate architectural features• Hard to tie to real applications




Projects Within BIPS

• Application evaluation on vector processors• APEX: Application Performance Characterization

Benchmarking• BeBop: Berkeley Benchmarking and Optimization

Group • Architectural probes for alternative architectures• LAPACK: Linear Algebra Package• PERC: Performance Engineering Research Center• Top500• ViVA: Virtual Vector Architectures




Application Evaluation of Vector Systems

Two vector architectures: The Japanese Earth Simulator The Cray X1

Comparison to “commodity”-based systems IBM SP, Power4 SGI Altix

Ongoing study of DOE applications CACTUS Astrophysics 100,000 lines grid based PARATEC Material Science 50,000 lines Fourier space LBMHD Plasma Physics 1,500 lines grid based GTC Magnetic Fusion 5,000 lines particle based MADCAP Cosmology 5,000 lines dense lin. alg.

Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. Hongzhang




APEX-MAP Benchmark

• Goal: Quantify the effects of temporal and spatial locality

• Focus on memory system and network performance

1 4

16 64

256

1024

4096

1638

4

6553

6

0.00

0.010.10

1.000.10

1.00

10.00

100.00

1000.00

Cycles

L

a

Altix Itanium2 Sequential2.00-3.00

1.00-2.00

0.00-1.00

-1.00-0.00

• Graphs over temporal and spatial locality axes

• Show performance valleys/cliffs




Application Kernel Benchmarks

• Microbenchmarks are good for:• Identifying architecture/compiler bottlenecks• Optimization opportunities

• Application benchmarks are good for:• Machine selection for specific apps

• In between: Benchmarks to capture important behavior in real applications• Sparse matrices: SPMV benchmark• Stencil operations: Stencil probe• Possible future: sorting, narrow datatype ops,…




Sparse Matrix Vector Multiply (SPMV)

• Sparse matrix algorithms• Increasingly important in applications• Challenge memory systems: poor locality• Many matrices have structure, e.g., dense sub-

blocks, that can be exploited

• Benchmarking SPMV• NAS CG, SciMark, use a random matrix • Not reflective of most real problems

• Benchmark challenge:• Ship real matrices: cumbersome & inflexible• Build “realistic” synthetic matrices




Athlon 1.145 GHz

Opteron 1.4 GHz

Itanium 2 1.3 GHz

Itanium 2 1.5 GHz

Apple G5 1.8 GHz

Pentium 4 2.4 GHz

0

1

2

3

4

5

6

7

8

9

SpMV speedup from BCSR

SpMV medium best case speedupSpMV medium FEM speedup

Mflo

p/s

per

Mflo

p/s

(sca

led

by 1

x1)

Speedup of best-case blocked matrix vs unblocked

Importance of Using Blocked Matrices




Generating Blocked Matrices

• Our approach: Uniformly distributed random structure, each a rxc block• Collect data for r and c from 1 to 12

• Validation: Can our random matrices simulate “typical” matrices?• 44 matrices from various applications

• 1: Dense matrix in sparse format• 2-17: Finite-Element-Method matrices, FEM

– 2-9: single block size, 10-17: multiple block sizes

• 18-44: non-FEM

• Summarization: Weighted by occurrence in test suite (ongoing)




Itanium 2 prediction

1 2 3 4 6 7 9 11 12 13 15 17 20 21 24 25 26 27 28 36 40 41 42 440

200

400

600

800

1000

1200

1400

MFLOP/s: SpMV benchmark vs. test matrices: Itanium 2

Mflop/s (benchmark, data size matching test matrix) Mflop/s (test matrices, scaled for fill ratio)

Matrix

Mflo

p/s




UltraSparc III prediction

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

Mflop/s: SpMV benchmark vs.test matrices: Sun UltraSparc 3

Test matrix (Mflop/s)SpMV benchmark (Mflop/s)

Test matrix number

Mfl

op

/s




Sample summary results (Apple G5, 1.8 GHz)

=============================================================Sparse matrix-vector multiplication (SpMV) benchmark results:-------------------------------------------------------------Matrix type Matrix size Mflop/s (r,c)-------------------------------------------------------------Best case small 2033.12 (12,6)FEM (blocked) small 1073.04Non-blocked small 299.536 (1,1)(small matrix is 2779 x 2779 with 0.0022282 fill)-------------------------------------------------------------Best case medium 541.875 (12,8)FEM (blocked) medium 443.346Non-blocked medium 193.14 (1,1)(medium matrix is 12154 x 12154 with 0.00222844 fill)=============================================================




Selected SpMV benchmark results

1.Raw results● Which machine is fastest

2.Scaled machine's peak floating-point rate● Mitigates chip technology factors● Influenced by compiler issues

3.Fraction of peak memory bandwidth● Use Stream bechmark for “attainable peak”● How close to this bound is SPMV running?




Athlon 1.145 GHz

Opteron 1.4 GHz

Itanium 2 1.5 GHz

Apple G5 1.8 GHz

Pentium 4 2.4 GHz

0

100

200

300

400

500

600

700

800

900

1000

Raw SpMV performance for medium problem size

SpMV medium best caseSpMV medium FEMSpMV medium 1x1

Mflo

p/s




Athlon 1.145 GHz

Opteron 1.4 GHz

Itanium 2 1.5 GHz

Apple G5 1.8 GHz

Pentium 4 2.4 GHz

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

SpMV performance for medium problem size,

scaled by peak floating-point rate SpMV medium best caseSpMV medium FEMSpMV medium 1x1

Fra

ctio

n of

pea

k




Athlon 1.145 GHz

Opteron 1.4 GHz

Itanium 2 1.5 GHz

Apple G5 1.8 GHz

Pentium 4 2.4 GHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SpMV (medium) memory band-width,

scaled by peak memory bandwidthSpMV (medium best case) mem bw scaled by peak mem bwSpMV (medium 1x1) mem bw scaled by peak mem bw

Fra

ctio

n of

pea

k




Automatic Performance Tuning

• Performance depends on machine, kernel, matrix• Matrix known at run-time• Best data structure + implementation can be surprising

• Filling in explicit zeros can• Reduce storage• Improve performance• PIII example: 50% more nonzeros, 50% faster

• BeBOP approach: empirical modeling and search• Up to 4x speedups and 31% of peak for SpMV• Many optimization techniques for SpMV• Several other kernels: triangular solve, ATA*x, Ak*x• Proof-of-concept: Integrate with Omega3P• Release OSKI Library, integrate into PETSc




Summary of Optimizations

• Optimizations for SpMV (numbers shown are maximums)• Register blocking (RB): up to 4x • Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x • Reordering to create dense structure + splitting: 2x • Symmetry: 2.8x • Cache blocking: 6x • Multiple vectors (SpMM): 7x

• Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x

• Higher-level kernels• AAT*x, ATA*x: 4x • A2*x: 2x over CSR, 1.5x

• Future: automatic tuning for vectors




Architectural Probes

• Understanding memory system performance

• Interaction with processor architecture:• Number of registers• Arithmetic units (parallelism)• Prefetching• Cache size, structure, policies

• APEX-MAP: memory and network system• Sqmat: processor features included




Impact of Indirection

• Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values

• Itanium2 showing high penalty for indirection

• Results from the sqmat “probe”

• Unit stride access via indirection (S=1)

1

2

3

4

5

6

1 2 4 8 16 32 64 128 256 512

M (Computational Intensity)

Slo

wd

ow

n

Itanium2

Opteron

Power3

Power4




Tolerating Irregularity

• S50 (Penalty for random access)• S is the length of each unit stride run • Start with S= (indirect unit stride)• How large must S be to achieve at least 50% of this

performance?• All done for a fixed computational intensity

• CI50 (Hide random access penalty using high computational intensity)• CI is computational intensity, controlled by number of

squarings (M) per matrix• Start with M=1, S=• At S=1 (every access random), how large must M be to

achieve 50% of this performance?

• For both, lower numbers are better





•Gather/Scatter is expensive on commodity cache-based systems

•Power4 is only 1.6% (1 in 64)•Itanium2: much less sensitive at 25% (1 in 4)

•Huge amount of computation may be required to hide overhead of irregular data access

•Itanium2 requires CI of about 9 flops/word•Power4 requires CI of almost 75!

S50: What % of memory access can be random before

performance decreases by half?CI50: How much computational

intensity is required to hide penalty of all random access?

Reducing performance by 50%

1.6%

25%

6.3%

0.8%

0%

1%

10%

100%

Itanium 2 Opteron Power3 Power4

% In

dire

ctio

n

CI required to hide indirection

9.3

149.3

18.7

74.7

0

50

100

150

200

Itanium 2 Opteron Power3 Power4Com

puta

tiona

l Int

ensi

ty

(CI)




Memory System Observations

• Caches are important• Important gap has moved:

• between L3/memory, not L1/L2

• Prefetching increasingly important• Limited and finicky• Effect may overwhelm cache optimizations if

blocking increases non-unit stride access

• Sparse codes: matrix volume is key factor• Not the indirect loads




Ongoing Vector Investigation

• How much hardware support for vector-like performance?• Can small changes to a conventional processor

get this effect?• Role of compilers/software• Related to Power5 effort

• Latency hiding in software• Prefetch engines easily confused• Sparse matrix (random) and grid-based (strided)

applications are target• Currently investigating simulator tools and

any emerging hardware




Summary

• High level goals:• Understand future HPC architecture options

that are commercially viable• Can minimal hardware extensions make

improve effectiveness for scientific applications• Various technologies

• Current, future, academic• Various performance analysis techniques

• Application level benchmarks• Application kernel benchmarks (SPMV, stencil)• Architectural probes• Performance modeling and prediction




People within BIPS

• Jonathan Carter• Kaushik Datta• James Demmel• Joe Gebis• Paul Hargrove• Parry Husbands• Shoaib Kamil• Bill Kramer• Rajesh Nishtala• Leonid Oliker

• John Shalf• Hongzhang Shan• Horst Simon• David Skinner• Erich Strohmaier• Rich Vuduc• Mike Welcome• Sam Williams• Katherine Yelick

And many collaborators outside Berkeley Lab/Campus



End of Slides




Sqmat overview

• Java code generate produces unrolled C code• Stream of matrices

• Square each Matrix M times in• M controls computational intensity (CI) - the ratio

between flops and mem access

• Each matrix is size NxN • N controls working set size: 2N2 registers required per

matrix. N is varied to cover observable register set size.

• Two storage formats:• Direct Storage: Sqmat’s matrix entries stored

continuously in memory• Indirect: Entries accessed through indirection vector.

“Stanza length” S controls degree of indirection

NxN . . .S in a row




Slowdown due to Indirection

• Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values

• Itanium2 showing high penalty for indirection

1

2

3

4

5

1 2 4 8 16 32 64 128 256 512M

slo

wd

ow

n Itanium 2OpteronPower3Power4

Unit stride access via indirection (S=1)




Potential Impact on Applications: T3P

• Source: SLAC [Ko] • 80% of time spent in SpMV• Relevant optimization techniques

• Symmetric storage• Register blocking

• On Single Processor Itanium 2• 1.68x speedup

• 532 Mflops, or 15% of 3.6 GFlop peak• 4.4x speedup with 8 multiple vectors

• 1380 Mflops, or 38% of peak




Potential Impact on Applications: Omega3P

• Application: accelerator cavity design [Ko]• Relevant optimization techniques

• Symmetric storage• Register blocking• Reordering

• Reverse Cuthill-McKee ordering to reduce bandwidth• Traveling Salesman Problem-based ordering to create blocks

– Nodes = columns of A– Weights(u, v) = no. of nz u, v have in common– Tour = ordering of columns– Choose maximum weight tour– See [Pinar & Heath ’97]

• 2x speedup on Itanium 2, but SPMV not dominant





• S50 (Penalty for random access)• S is the length of each unit stride run • Start with S= (indirect unit stride)• How large must S be to achieve at least 50% of this

performance?• All done for a fixed computational intensity

• CI50 (Hide random access penalty using high computational intensity)• CI is computational intensity, controlled by number of

squarings (M) per matrix• Start with M=1, S=• At S=1 (every access random), how large must M be to

achieve 50% of this performance?

• For both, lower numbers are better





•Gather/Scatter is expensive on commodity cache-based systems

•Power4 is only 1.6% (1 in 64)•Itanium2: much less sensitive at 25% (1 in 4)

•Huge amount of computation may be required to hide overhead of irregular data access

•Itanium2 requires CI of about 9 flops/word•Power4 requires CI of almost 75!

S50: What % of memory access can be random before

performance decreases by half?CI50: How much computational

intensity is required to hide penalty of all random access?

Reducing performance by 50%

1.6%

25%

6.3%

0.8%

0%

1%

10%

100%

Itanium 2 Opteron Power3 Power4

% In

dire

ctio

n

CI required to hide indirection

9.3

149.3

18.7

74.7

0

50

100

150

200

Itanium 2 Opteron Power3 Power4Com

puta

tiona

l Int

ensi

ty

(CI)




Emerging Architectures

• General purpose processors badly suited for data intensive ops• Large caches not useful if re-use is low

• Low memory bandwidth, especially for irregular patterns

• Superscalar methods of increasing ILP inefficient

• Power consumption

• Research architectures• Berkeley IRAM: Vector and PIM chip

• Stanford Imagine: Stream processor

• ISI Diva: PIM with conventional processor




Sqmat on PIM Systems

• Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!)

• Imagine much faster for long streams, slower for short ones

Mflop/s as with varying stream lengths

0

500

1000

1500

2000

2500

3000

3500

IMAGINE IRAM DIVA Power3

MF

lop

s

8

16

32

64

128

256

512

1024




Comparison to HPCC “Four Corners”

Tem

pora

lL

ocal

ity

SpatialLocality

FFT (future

)

RandomAccess

Sqmat S=1 M=1 N=1

Stream

SqmatS=0 M=1

N=1

LINPACK

SqmatS=0 M=8 N=8

OpteronLINPACK 2000 MFLOPS @1.4ghz

Sqmat 2145 MFLOPS @1.6ghz

STREAMS 1969 MB/sSqmat 2047 MB/s

RandomAccess 0.00442 GUPsSqmat 0.00440 GUPs

Itanium2LINPACK 4.65 GFLOPsSqmat 4.47 GFLOPs

STREAMS 3895 MB/sSqmat 4055 MB/s

RandomAccess 0.00484 GUPsSqmat 0.0141 GUPs

B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A...

Documents

Transcript of B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A...