Download - 1 High-Performance Grid Computing and Research Networking Presented by Juan Carlos Martinez Instructor: S. Masoud Sadjadi

1

High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking

Presented by Juan Carlos Martinez

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

High-Performance High-Performance Sequential ProgrammingSequential Programming

2

Acknowledgements The content of many of the slides in this lecture notes have

been adopted from the online resources prepared previously by the people listed below. Many thanks!

Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]

mailto:[email protected]

3

Sequential Programs In this class we’re mostly focusing on concurrent

programs But it’s useful to recall some simple notions of high

performance for sequential programs Because some fundamental techniques are meaningful

for concurrent programs Because in your projects you’ll have to get code to go

fast, and a concurrent program is just simultaneous sequential programs

We’ll look at Standard code optimization techniques Optimizations dealing with memory issues

4

Loop Constants Identifying loop constants:

for (k=0;k<N;k++) {c[i][j] += a[i][k] * b[k][j];

}

sum = 0;for (k=0;k<N;k++) {

sum += a[i][k] * b[k][j];}c[i][j] = sum;

5

Multi-dimensional Array Accesses A static 2-D array is one declared as

<type> <name>[<size>][<size>]int myarray[10][30];

The elements of a 2-D array are stored in contiguous memory cells

The problem is that: The array is 2-D, conceptually Computer memory is 1-D

1-D computer memory: a memory location is described by a single number, its address

Just like a single axis Therefore, there must be a mapping from 2-D to 1-D

From a 2-D abstraction to a 1-D implementation

6

Mapping from 2-D to 1-D?

nxn 2-D array1-D computer memory

A 2-D to 1-D mapping

Another 2-D to 1-D mappingn2! possible mappings

7

Row-Major, Column-Major Luckily, only 2 of the n2! mappings are ever

implemented in a language

Row-Major: Rows are stored contiguously

Column-Major: Columns are stored contiguously

1st row 2nd row 3rd row 4th row

1st col 2nd col 3rd col 4th col

8

Row-Major

C uses Row-Major

address

memory/cache line

rows inmemory

memorylines

Matrix elements are stored in contiguous memory lines

9

Column-Major

FORTRAN uses column-Major

address

memory/cache line

columns inmemory

memorylines

Matrix elements are stored in contiguous memory lines

10

Address Computation

Address Computation: @(a[i][j]) = @(a[0][0]) + i*N + j Detail: there should be a sizeof() factor as well

Example: N = 6, M = 2 @(a[2][3]) = @(a[0][0]) + 2*6 + 3 = @(a[0][0]) + 15

For column-major (like in FORTRAN), the formula is reversed: @(a[i][j]) = @(a[0][0]) + j*M + i, or @(a[i][j]) = @(a[1][1]) + (j-1)*M + i-1

X X

@(a[0][0])

i

j

N

M

i*N j

Example: a MxN row-major array

11

Array Accesses are Expensive Given that the formula is

@(a[i][j]) = @(a[0][0]) + i*N + j Each array access entailed 2 additions and 1 multiplication

This is even higher for higher dimension arrays Therefore, when the compiler compiles the instruction

sum += a[i][k] * b[k][j]; 4 integer additions and 2 integer multiplications are

generated just to compute addresses! And then 1 fp multiplication and 1 fp addition

If the bottleneck is memory, then we don’t care But if the processor is not starved for data (which we will see

is possible for this application), then the overhead of computing addresses is large

12

Removing Array Accesses Replace array accesses by pointer dereferences

for (j=0;j<N;j++) a[i][j] = 2; // 2*N adds, N multiplies

double *ptr = &(a[i][0]); // 2 adds, 1 multiplies for (j=0;j<N;j++) { *ptr = 2;

ptr++; // N integer addition }

13

Loop Unrolling Loop unrolling:

for (i=0;i<100;i++) // 100 comparisons a[i] = i;

i=0;do {

a[i] = i; i++; a[i] = i; i++;

a[i] = i; i++; a[i] = i; i++;} while (i<100) // 25 comparisons

14

Loop Unrolling One can unroll a loop by more (or less) than

5-fold If the unrolling factor does not divide the

number of iterations, then one must add a few iterations before the loop

Trade-off: performance gain code size

15

Code Motion

Code Motion

sum = 0;for (i = 0; i <= fact(n); i++)

sum += i;

sum = 0;f = fact(n);for (i = 0; i <= f; i++)

sum += i;

16

Inlining Inlining:

for (i=0;i<N;i++) sum += cube(i);...void cube(i) { return (i*i*i); }

for (i=0;i<N;i++) sum += i*i*i;

17

Common Sub-expression Common sub-expression elimination

x = a + b - c; y = a + d + e + b;

tmp = a + b;x = tmp - c;y = tmp + d + e;

18

Dead code Dead code elimination x = 12;

...x = a+c;

...x = a+c;

Seems obvious, but may be “hidden”

int x = 0;...#ifdef FOO x = f(3);#else

19

Other Techniques Strength reduction

a = i*3; a = i+i+i;

Constant propagationint speedup = 3;efficiency = 100* speedup / numprocs;x = efficiency * 2;x = 600 / numprocs;

20

So where are we? We have seen a few of optimization techniques but

there are many other! We could apply them all to the code but this would

result in completely unreadable/undebuggable code Fortunately, the compiler should come to the rescue

To some extent, at least Some compiler can do a lot for you, some not so

much Typically compilers provided by a vendor can do

pretty tricky optimizations

21

What do compilers do? All modern compilers perform some automatic optimization

when generating code In fact, you implement some of those in a graduate-level compiler

class, and sometimes at the undergraduate level. Most compilers provide several levels of optimization

-O0: No optimization in fact some is always done

-O1, -O2, .... -OX The higher the optimization level the higher the probability

that a debugger may have trouble dealing with the code. Always debug with -O0

some compiler enforce that -g means -O0 Some compiler will flat out tell you that higher levels of

optimization may break some code!

22

Compiler optimizations In this class we use gcc, which is free and pretty good

-Os: Optimize for size Some optimizations increase code size tremendously

Do a “man gcc” and look at the many optimization options one can pick and choose, or just use standard sets via O1, O2, etc.

The most fancy compilers are typically the ones done by vendors

You can’t sell a good machine if it has a bad compiler Compiler technology used to be really poor also, languages used to be designed without thinking of compilers

(FORTRAN, Ada) no longer true: every language designer has in-depth understanding

of compiler technology today

23

What can compilers do? Most of the techniques we’ve seen!!

Inlining Assignment of variables to registers

It’s a difficult problem Dead code elimination Algebraic simplification Moving invariant code out of loops Constant propagation Control flow simplification Instruction scheduling, reordering Strength reduction

e.g., add to pointers, rather than doing array index computation Loop unrolling and software pipelining Dead store elimination and many other......

24

What can compilers do? Most of the techniques we’ve seen!!

Inlining Assignment of variables to registers

It’s a difficult problem Dead code elimination Algebraic simplification Moving invariant code out of loops Constant propagation Control flow simplification Instruction scheduling, reordering Strength reduction

e.g., add to pointers, rather than doing array index computation Loop unrolling and software pipelining Dead store elimination and many other......

25

Instruction scheduling Modern computers have multiple functional units

that could be used in parallel Or at least ones that are pipelined

if fed operands at each cycle they can produce a result at each cycle

although a computation may require 20 cycles Instruction scheduling:

Reorder the instructions of a program e.g., at the assembly code level

Preserve correctness Make it possible to use functional units optimally

26

Instruction Scheduling One cannot just shuffle all instructions around Preserving correctness means that data

dependences are unchanged Three types of data dependences:

True dependencea = ...

... = a Output dependencea = ...a = ...

Anti dependence... = aa = ...

27

Instruction Scheduling Example... ...ADD R1,R2,R4 ADD R1,R2,R4ADD R2,R2,1 LOAD R4,@2ADD R3,R6,R2 ADD R2,R2,1LOAD R4,@2 ADD R3,R6,R2... ... Since loading from memory can take many cycles,

one may as well do is as early as possible Can’t move instruction earlier because of anti-

dependence on R4

28

Software Pipelining

Fancy name for “instruction scheduling for loops” Can be done by a good compiler

First unroll the loop Then make sure that instructions can happen in parallel

i.e., “scheduling” them on functional units Let’s see a simple example

29

Example

Source code: for(i=0;i<n;i++) sum += a[i]

Loop body in assembly:

Unroll loop &allocate registers May be very difficult

r1 = L r0--- ;stall r2 = Add r2,r1r0 = add r0,4

r1 = L r0--- ;stall r2 = Add r2,r1r0 = Add r0,12r4 = L r3--- ;stall r2 = Add r2,r4r3 = add r3,12r7 = L r6--- ;stall r2 = Add r2,r7r6 = add r6,12r10 = L r9--- ;stall r2 = Add r2,r10r9 = add r9,12

30

Example (cont.)

r1 = L r0r4 = L r3 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = add r9,12 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = add r9,12 r2 = Add r2,r1 r7 = L r6 . . .r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = add r3,12 r2 = Add r2,r7r6 = add r6,12 Add r2,r10 r9 = add r9,12

Schedule Unrolled Instructions, exploiting instructionlevel parallelism if possible

Identifyrepeatingpattern(kernel)

31

Example (cont.)

Loop becomes:

r1 = L r0r4 = L r3 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = Add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = Add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = Add r9,12 r2 = Add r2,r1 r7 = L r6

r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = Add r3,12 r2 = Add r2,r7r6 = Add r6,12 Add r2,r10 r9 = Add r9,12

epilogue

prologue

kernel

32

Software Pipelining The “kernel” may require many registers and it’s nice to

know how to use as few as possible otherwise, one may have to go to cache more, which may negate the

benefits of software pipelining Dependency constraints must be respected

May be very difficult to analyze for complex nested loops Software pipelining with registers is a very well-known NP-

hard program

33

Limits to Compiler Optimization Behavior that may be obvious to the programmer can be

obfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggest

e.g., using an “int” in C for what could be an enumerated type Most analysis is performed only within procedures

whole-program analysis is too expensive in most cases Most analysis is based only on static information

compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative

cannot perform optimization if it changes program behavior under any realizable circumstance

even if circumstances seem quite bizarre and unlikely

34

Good practice Writing code for high performance means

working hand-in-hand with the compiler #1: Optimize things that we know the

compiler cannot deal with For instance the “blocking” optimization for matrix

multiplication may need to be done by hand But some compiler may find the best i-j-k

ordering!! #2: Write code so that the compiler can do its

optimizations Remove optimization blockers

35

Optimization blocker: aliasing Aliasing: two pointers point to the same location If a compiler can’t tell what a pointer points at, it must

assume it can point at almost anything Example:

void foo(int *q, int *p) {*q = 3;*p++;*q *= 4;}

cannot be safely optimized to: *p++;*q = 12;

because perhaps p = q Some compilers have pretty fancy aliasing analysis

capabilities

37

Blocker: Function Callsum = 0;for (i = 0; i <= fact(n); i++)

sum += i; A compiler cannot optimize this because

function fact may have side-effects e.g., modifies global variables

Function May Not Return Same Value for Given Arguments Depends on other parts of global state, which may be modified in the loop

Why doesn’t compiler look at the code for fact? Linker may overload with different version

Unless declared static Interprocedural optimization is not used extensively due to cost Inlining can achieve the same effect for small procedures

Again: Compiler treats procedure call as a black box Weakens optimizations in and around them

38

Other Techniques Use more local variables

while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}

register float f0 = filter[0];register float f1 = filter[1];register float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}

Helps some compilers

39

Other Techniques Replace pointer updates for strided memory addressing with

constant array offsets

f0 = *r8; r8 += 4;f1 = *r8; r8 += 4;f2 = *r8; r8 += 4;

f0 = r8[0];f1 = r8[4];f2 = r8[8];r8 += 12;

Some compilers are betterat figuring this out thanothers

Some systems may go faster with option #1, some others with option #2!

40

Bottom line Know your compilers

Some are great Some are not so great Some will not do things that you think they should do

often because you forget about things like aliasing There is not golden rule because there are some

system-dependent behaviors Although the general principles typically holds

Doing all optimization by hand is a bad idea in general

41

By-hand Optimization is bad?

Turned array accesses into pointer dereferences

Assign to each element of c just once

for(i = 0; i < SIZE; i++) { int *orig_pa = &a[i][0]; for(j = 0; j < SIZE; j++) { int *pa = orig_pa; int *pb = &a[0][j]; int sum = 0; for(k = 0; k < SIZE; k++) { sum += *pa * *pb; pa++; pb += SIZE; } c[i][j] = sum; }}

for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) {

c[i][j]+=a[i][k]*b[k][j]; } }}

42

Results (Courtesy of CMU)

R10000 Simple Optimized

cc –O0 34.7s 27.4s

cc –O3 5.3s 8.0s

egcc –O9 10.1s 8.3s

21164 Simple Optimized

cc –O0 40.5s 12.2s

cc –O5 16.7s 18.6s

egcc –O0 27.2s 19.5s

egcc –O9 12.3s 14.7s

Pentium II Simple Optimized

egcc –O9 28.4s 25.3s

RS/6000 Simple Optimized

xlC –O3 63.9s 65.3s

43

Why is Simple Sometimes Better? Easier for humans and the compiler to understand

The more the compiler knows the more it can do Pointers are hard to analyze, arrays are easier You never know how fast code will run until you

time it The transformations done by hand good optimizers

will often do for us And they will often do a better job than we can do

Pointers may cause aliases and data dependences where the array code had none

44

Bottom LineHow should I write my programs, given that I have a good,

optimizing compiler? Don’t: Smash Code into Oblivion

Hard to read, maintain & ensure correctness Do:

Select best algorithm Write code that’s readable & maintainable

Procedures, recursion, without built-in constant limits Even though these factors can slow down code

Eliminate optimization blockers Allows compiler to do its job

Account for cache behavior Focus on Inner Loops

Use a profiler to find important ones!

45

Memory One constant issue that unfortunately compilers do

not do very well with is memory and locality Although some recent compilers have gotten pretty smart

about it

Let’s look at this in detail because the ideas apply strongly to high performance for concurrent programs No point in writing a concurrent program if its sequential

components are egregiously suboptimal

46

The Memory Hierarchy

CPU

regs

Cache

Memory disk

Cache

register reference

L2-cache(SRAM)

reference

memory (DRAM)reference

disk reference

L1-cache(SRAM)

reference

larger, slower, cheaper

sub ns 1-2 cycles 20 cycleshundreds

cycles10 cycles

Cache

L3-cache(DRAM)

reference tens of thousandscycles

Spatial locality: having accessed a location, a nearby location is likely to be accessed next

Therefore, if one can bring in contiguous data items “close” to the processor at once, then perhaps a sequence of instructions will find them ready for use

Temporal locality: having accessed a location , this location is likely to be accessed again

Therefore, if one can keep recently accessed data items “close” to the processor, then perhaps the next instructions will fin them ready for use.

Numbers roughly based on 2005 Intel P4 processors with multi GHz clock rates

47

Caches There are many issues regarding cache design

Direct-mapped, associative Write-through, Write-back How many levels etc.

But this belongs to a computer architecture class Question: Why should the programmer care? Answer: Because code can be re-arranged to

improve locality And thus to improve performance

48

Example #1: 2-D Array Initializationint a[200][200]; int a[200][200];for (i=0;i<200;i++) { for (j=0;j<200;j++) { for (j=0;j<200;j++) { for (i=0;i<200;i++) { a[i][j] = 2; a[i][j] = 2; } }} }

Which alternative is best? i,j? j,i?

To answer this, one must understand the memory layout of a 2-D array

49

Row-Major C uses Row-Major First optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

Second optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

50

Counting cache misses

nxn 2-D array, element size = e bytes, cache line size = b bytes

memory/cache line

memory/cache line

One cache miss for every cache line: n2 x e /b Total number of memory accesses: n2

Miss rate: e/b Example: Miss rate = 4 bytes / 64 bytes = 6.25%

Unless the array is very small

One cache miss for every access Example: Miss rate = 100%

Unless the array is very small

51

Array Initialization in C

First optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

Second optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

Good Locality

52

Performance Measurements

Option #1int a[X][X];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

Option #2int a[X][X];for (j=0;j<200;j++) for (i=0;i<200;i++) a[i][j]=2;

Experiments on my laptop

0

5

10

15

20

25

30

0 200 400 600 800 1000 12002-D Array Dimension

Exec

utio

n Ti

me

Option #1 Option #2

Note that other languages use column major e.g., FORTRAN

53

Matrix Multiplication The previous example was very simple But things can get more complicated very quickly Let’s look at a simple program to multiply two

square matrices A fundamental operation in linear algebra

Linear system resolution Computing the transitive closure of a graph etc.

Probably the most well-studied problem in HPC clever algorithms clever implementations

54

Matrix Multiplication A = [aij]i,j=1,...,N

B = [bij]i,j=1,...,N

C = A x B = [cij]i,j=1,...,N

i

j cij

ai1

b1i

b2i

ai2

xx .

. .

Like most linear algebra operations, this formula can be translated into a very simple computer program that just “follows” the math

55

Matrix Multiplication Algorithm All matrices are stored in 2-D arrays of dimension NxN.

int i,j,k; double a[N][N], b[N][N], c[N][N]; ... initialization of a and b ... for (i=0;i<N;i++) for (j=0;j<N;j++) {

c[i,j] = 0.0;for (k=0;k<N;k++) {

c[i,j] += a[i,k] * b[k,j];}

}}

56

How good is this algorithm? This algorithm is good because:

It takes only a few lines It is a direct mapping to the formula for cij Anybody should be able to understand what it

does by just looking at it It almost certainly has no bug because it is so

simple This algorithm is bad because:

It has terrible performance because it ignores the fact that the underlying computer has a memory hierarchy

57

First Performance Improvementfor (i=0;i<N;i++) for (j=0;j<N;j++) {

c[i][j] = 0.0;for (k=0;k<N;k++) {

c[i][j] += a[i][k] * b[k][j];}

}}

Note that it is assume the compiler will remove c[i][j] form the inner loop, unroll loops, etc.

First idea: Switching loops around? After all it worked for array initialization

58

Loop permutations in MatMulfor (i=0;i<N;i++)for (j=0;j<N;j++)

for (k=0;k<N;k++)c[i,j] += a[i][k] * b[k][j];

There are 6 possible orders for the three loops i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i

Each order corresponds to a different access patterns of the matrices

Let’s focus on the inner loop, as it is the one that’s executed most often

59

Inner Loop Memory Accesses Each matrix element can be accessed in three modes in the

inner loop Constant: doesn’t depend on the inner loop’s index Sequential: contiguous addresses Stride: non-contiguous addresses (N elements apart)

c[i][j] += a[i][k] * b[k][j]; i-j-k: Constant Sequential Strided i-k-j: Sequential Constant Sequential j-i-k: Constant Sequential Strided j-k-i: Strided Strided Constant k-i-j: Sequential Constant Sequential k-j-i: Strided Strided Constant

60

Loop order and Performance Constant access is better than sequential

access it’s always good to have constants in loops

because they can be put in registers (as we’ve seen in our very first optimization)

Sequential access is better than strided access sequential access is better than strided because

it utilizes the cache better Let’s go back to the previous slides

61

Best Loop Ordering?c[i][j] += a[i][k] * b[k][j];

i-j-k: Constant Sequential Stridedi-k-j: Sequential Constant Sequentialj-i-k: Constant Sequential Stridedj-k-i: Strided Strided Constantk-i-j: Sequential Constant Sequentialk-j-i: Strided Strided Constant

k-i-j and i-k-j should have the best performance i-j-k and j-i-k should be worse j-k-i and k-j-i should be the worst

You will measure this in the first (warm-up) assignment

62

How good is the best ordering? Let us assume that i-k-j is best How many cache misses?

for (i=0;i<N;i++) for (k=0;k<N;k++){ sum=0; for (j=0;j<N;j++) sum+=a[i][k]*b[k][j]; c[i,j] = sum; }

Clearly this is not easy to compute e.g., if the matrix is twice the size of the cache, there is a lot of

loading/evicting and obtaining a formula would be complicated Let L be the cache size in number of matrix elements How about a very coarse approximation, by assuming that

the matrix is much larger than the cache? determine what matrix pieces are loaded/written Figure out the expected number of cache misses

63

Slow Memory Operationsfor (i=0;i<N;i++) // (1) read row i of a into cache // (2) write row i of c back to memory for (k=0;k<N;k++) // (3) read column j of b into cache for (j=0;j<N;j++) c[i,j]+=a[i][k]*b[k][j]; L: cache line size (1): N * (N / L) cache misses (2): N * (N / L) cache misses (3): N * N * N cache misses

Although the access to B is sequential, it’s sequential along the column and the matrix is store in row-major fashion!

Total: 2N2/L + N3 ≈ N3 (for large n)

64

Bad News ≈ N3 slow memory operations and 2N3 arithmetic operations Ratio ops / mem ≈ 2 This is bad news because we know that computer architectures are NOT

balanced and memory operations are orders of magnitude slower than arithmetic operations

Therefore, the memory is still the bottleneck for this implementation of matrix multiplication (the ratio should be much higher)

BUT: we have only N2 matrix elements, how come we perform N3 slow memory accesses?

Because we access matrix B very inefficiently, trying to load entire columns one after the other

Lesson: counting the number of operations and comparing it with the size of the data is not sufficient to ascertain that an algorithm will not suffer from the memory bottleneck

65

Better cache reuse? Since we really need only N2 elements, perhaps there is a better way to

reorganize the operations of the matrix multiplication for a higher number of cache hits

Possible because ‘+’ and ‘*’ are associative and commutative Researchers have spent a lot of time trying to find out the best ordering There are even theorems!

Let q = ratio of operations to slow memory accesses q must be as high as possible to remove the memory bottleneck [Hong&Kung 1981] Any reorganization of the algorithm is limited to q =

O(√M), where M is the size of the cache (in number of elements) obtained with a lot of unrealistic assumptions about the cache still shows that q won’t scale with N, unlike what one may think when dividing 2n3

by n2.

66

“Blocked” Matrix Multiplication

One problem with our implementation is that we try to access entire columns of matrix B.

What about accessing only a subset of a column, or of multiple columns, at a time?

67


i

j j

i

A BC

cache line

Key idea: reuse the other elements ineach cache line as much as possible

68


i

j j

i

A BC

cache line

May as well compute ci,j+1 since one loads column j+1 ofB in the cache lines anyway.But must reorder the operations as follows compute the first b terms of cij, compute the first b terms of ci,j+1

compute the next b terms of cii, compute the next b terms of cij+1

.....

b elements

b el

emen

ts

69


i

j j

i

A BC

cache line

May as well compute a whole subrow of C, with the same reordering of the operations. But by computing a whole row of C, then one has to load all columns of B, which one has to do again for computing the next row of C.Idea: reuse the blocks of B that we have just loaded.

70


i

j j

i

A BC

cache line

Order of the operation:Compute the first b terms of all cij values in the C blockCompute the next b terms of all cij values in the C block. . .Compute the last b terms of all cij values in the C block

71


C11

C22 = A21B12 + A22B22 + A23B32 + A24B42

4 matrix multiplications 4 matrix additions Main Point: each multiplication operates on small “block” matrices, whose size may be chosen so that they fit in the cache.

C12 C13 C14

C21 C22 C23 C24

C31 C32 C43 C34

C41 C42 C43 C44

A11 A12 A13 A14

A21 A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A144

B11 B12 B13 B14

B21 B22 B23 B24

B32 B32 B33 B34

B41 B42 B43 B44

N = 4 * b

72

Blocked Algorithm

The blocked version of the i-j-k algorithm is written simply as

for (i=0;i<N/B;i++) for (j=0;j<N/B;j++) for (k=0;k<N/B;k++) C[i][j] += A[i][k]*B[k][j]

where B is the block size (which we assume divides N)

where X[i][j] is the block of matrix X on block row i and block column j

where “+=“ means matrix addition where “*” means matrix multiplication

73

Cache Misses?for (i=0;i<N/B;i++) for (j=0;j<N/B;j++) // (1) write block C[i][j] to memory for (k=0;k<N/B;k++)

// (2) Load block A[i][k] from memory // (3) Load block B[k][j] from memory

C[i][j] += A[i][k]*B[k][j]

(1): (N/b)*(N/b)*b*b (2): (N/b)*(N/b)*(N/b)*b*b (3): (N/b)*(N/b)*(N/b)*b*b Total: N2 + 2N3/b ≈ 2N3/b

74

Performance? Slow memory accesses ≈ 2N3/b Number of operations = 2N3

Therefore, ratio ops / mem ≈ b This ratio should be as high as possible (Compare to the value of 2 that we obtained with the non-

blocked implementation) This implies that one should make the block size as

large as possible But, if we take this result to the extreme, then the

block size should be equal to N!! This clearly doesn’t make sense because then we’re back

to the non-blocked implementation

75

Maximum Block Size The blocking optimization only works if the

blocks fit in cache That is, 3 blocks of size bxb must fit in

memory (for A, B, and C) Let M be the cache size (in elements) We must have: 3b2 ≤ M, or b ≤ √(M/3) Therefore, in the best case, ratio of number

of operations to slow memory accesses is: √(M/3)

76

Necessary cache size Therefore, given a machine with some ratio of arithmetic

operation speed to slow memory speed, one can compute the cache size necessary to run blocked matrix multiplication so that the processor never waits for memory

Cache Size (KB)

Ultra 2i 14.8Ultra 3 4.7Pentium 3 0.9Pentium 3M 2.4Power3 1.8Power4 5.4Itanium1 31.1Itanium2 0.7

77

TGE Case Study

C00=C00+A00*B00C00=C00+A01*B10

C01=C01+A00*B01C01=C01+A01*B11

C10=C10+A10*B00C10=C10+A11*B10

C11=C11+A10*B01C11=C11+A11*B11

78

Grid Superscalar

79

TRAP/J

80

Integration1. We start with the original Sequential Code.2. Grid Superscalar – Stubs and Skeletons3. TRAP/J – Delegate and Wrapper class4. Compilation and binding of Delegate and

Original Application5. Grid Enabled Application is obtained!

81

Back on Our Case Study…

82

Results

83

Strassen’s Matrix Multiplication The traditional algorithm (w/ or w/o blocking) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops

O(n^2.81) Consider a 2x2 matrix multiply

Normally it takes 8 multiplies, 7 adds Strassen does it with 7 multiplies and 18 adds, which is faster!

Let M = m11 m12 = a11 a12 b11 b12

m21 m22 = a21 a22 b21 b22

Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22)

p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)

p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11

p4 = (a11 + a12) * b22

Then m11 = p1 + p2 - p4 + p6

m12 = p4 + p5

m21 = p6 + p7

m22 = p2 - p3 + p5 - p7

Extends to nxn by divide&conquer

84

Overall Lessons Truly understanding application/cache behavior is

tricky But approximations can be obtained by which one

can reason about how to improve an implementation

The notion of blocking is a common recurrence in many algorithms and applications

You’ll write a blocked matrix multiplication in a Programming Assignment not hard, but tricky with indices and loops

85

Automatic Program Generation It is difficult to optimize code because

There are many possible options for tuning/modifying the code These options interact in complex ways with the compiler and the

hardware This is really an “optimization problem”

The objective function is the code’s performance The feasible solutions are all possible ways to implement the

software Typically a finite number of implementation decisions are to be made Each decision can take a range of values

e.g., the 7th loop in the 3rd function can be unrolled 1, 2, ..., 20 times e.g., the “block size” could be 2x2, 4x4, ..., 400x400 e.g., function could be recursive or iterative

And one needs to do it again and again for different platforms

86

Automatic Program Generation What is good at solving hard optimization

problems? computers

Therefore, a computer program could generate the computer program with the best performance Could use a brute force approach: try all possible

solutions but there is an exponential number of them

Could use genetic algorithms Could use some ad-hoc optimization technique

87

Matrix Multiplication We have seen that for matrix multiplication

there are several possible ways to optimize the code block size optimization flag to the compiler order of loops ...

It is difficult to find the best one People have written automatic matrix

multiplication program generators!

88

The ATLAS Project ATLAS is a software that you can download and run on most

platforms. It runs for a while (perhaps a couple of hours) and generates

a .c file that implements matrix multiplication! ATLAS optimizes for

Instruction cache reuse Floating point instruction ordering

pipeline functional units Reducing loop overhead Exposing parallelism

multiple functional units Cache reuse

89

ATLAS (500x500 matrices)

ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

Architectures

MFL

OPS

Vendor BLASATLAS BLASF77 BLAS

Source: Jack Dongarra

90

Conclusions Programming for performance means

working with the compiler knowing its limitations knowing its capabilities if it is unhindered

Finding the optimal code is really difficult, but dealing with locality is paramount

Automatic approaches for generating the code have been very successful in some cases