1
High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking
Presented by Juan Carlos Martinez
Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/
sadjadi At cs Dot fiu Dot edu
High-Performance High-Performance Sequential ProgrammingSequential Programming
2
Acknowledgements The content of many of the slides in this lecture notes have
been adopted from the online resources prepared previously by the people listed below. Many thanks!
Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]
3
Sequential Programs In this class we’re mostly focusing on concurrent
programs But it’s useful to recall some simple notions of high
performance for sequential programs Because some fundamental techniques are meaningful
for concurrent programs Because in your projects you’ll have to get code to go
fast, and a concurrent program is just simultaneous sequential programs
We’ll look at Standard code optimization techniques Optimizations dealing with memory issues
4
Loop Constants Identifying loop constants:
for (k=0;k<N;k++) {c[i][j] += a[i][k] * b[k][j];
}
sum = 0;for (k=0;k<N;k++) {
sum += a[i][k] * b[k][j];}c[i][j] = sum;
5
Multi-dimensional Array Accesses A static 2-D array is one declared as
<type> <name>[<size>][<size>]int myarray[10][30];
The elements of a 2-D array are stored in contiguous memory cells
The problem is that: The array is 2-D, conceptually Computer memory is 1-D
1-D computer memory: a memory location is described by a single number, its address
Just like a single axis Therefore, there must be a mapping from 2-D to 1-D
From a 2-D abstraction to a 1-D implementation
6
Mapping from 2-D to 1-D?
nxn 2-D array1-D computer memory
A 2-D to 1-D mapping
Another 2-D to 1-D mappingn2! possible mappings
7
Row-Major, Column-Major Luckily, only 2 of the n2! mappings are ever
implemented in a language
Row-Major: Rows are stored contiguously
Column-Major: Columns are stored contiguously
1st row 2nd row 3rd row 4th row
1st col 2nd col 3rd col 4th col
8
Row-Major
C uses Row-Major
address
memory/cache line
rows inmemory
memorylines
Matrix elements are stored in contiguous memory lines
9
Column-Major
FORTRAN uses column-Major
address
memory/cache line
columns inmemory
memorylines
Matrix elements are stored in contiguous memory lines
10
Address Computation
Address Computation: @(a[i][j]) = @(a[0][0]) + i*N + j Detail: there should be a sizeof() factor as well
Example: N = 6, M = 2 @(a[2][3]) = @(a[0][0]) + 2*6 + 3 = @(a[0][0]) + 15
For column-major (like in FORTRAN), the formula is reversed: @(a[i][j]) = @(a[0][0]) + j*M + i, or @(a[i][j]) = @(a[1][1]) + (j-1)*M + i-1
X X
@(a[0][0])
i
j
N
M
i*N j
Example: a MxN row-major array
11
Array Accesses are Expensive Given that the formula is
@(a[i][j]) = @(a[0][0]) + i*N + j Each array access entailed 2 additions and 1 multiplication
This is even higher for higher dimension arrays Therefore, when the compiler compiles the instruction
sum += a[i][k] * b[k][j]; 4 integer additions and 2 integer multiplications are
generated just to compute addresses! And then 1 fp multiplication and 1 fp addition
If the bottleneck is memory, then we don’t care But if the processor is not starved for data (which we will see
is possible for this application), then the overhead of computing addresses is large
12
Removing Array Accesses Replace array accesses by pointer dereferences
for (j=0;j<N;j++) a[i][j] = 2; // 2*N adds, N multiplies
double *ptr = &(a[i][0]); // 2 adds, 1 multiplies for (j=0;j<N;j++) { *ptr = 2;
ptr++; // N integer addition }
13
Loop Unrolling Loop unrolling:
for (i=0;i<100;i++) // 100 comparisons a[i] = i;
i=0;do {
a[i] = i; i++; a[i] = i; i++;
a[i] = i; i++; a[i] = i; i++;} while (i<100) // 25 comparisons
14
Loop Unrolling One can unroll a loop by more (or less) than
5-fold If the unrolling factor does not divide the
number of iterations, then one must add a few iterations before the loop
Trade-off: performance gain code size
15
Code Motion
Code Motion
sum = 0;for (i = 0; i <= fact(n); i++)
sum += i;
sum = 0;f = fact(n);for (i = 0; i <= f; i++)
sum += i;
16
Inlining Inlining:
for (i=0;i<N;i++) sum += cube(i);...void cube(i) { return (i*i*i); }
for (i=0;i<N;i++) sum += i*i*i;
17
Common Sub-expression Common sub-expression elimination
x = a + b - c; y = a + d + e + b;
tmp = a + b;x = tmp - c;y = tmp + d + e;
18
Dead code Dead code elimination x = 12;
...x = a+c;
...x = a+c;
Seems obvious, but may be “hidden”
int x = 0;...#ifdef FOO x = f(3);#else
19
Other Techniques Strength reduction
a = i*3; a = i+i+i;
Constant propagationint speedup = 3;efficiency = 100* speedup / numprocs;x = efficiency * 2;x = 600 / numprocs;
20
So where are we? We have seen a few of optimization techniques but
there are many other! We could apply them all to the code but this would
result in completely unreadable/undebuggable code Fortunately, the compiler should come to the rescue
To some extent, at least Some compiler can do a lot for you, some not so
much Typically compilers provided by a vendor can do
pretty tricky optimizations
21
What do compilers do? All modern compilers perform some automatic optimization
when generating code In fact, you implement some of those in a graduate-level compiler
class, and sometimes at the undergraduate level. Most compilers provide several levels of optimization
-O0: No optimization in fact some is always done
-O1, -O2, .... -OX The higher the optimization level the higher the probability
that a debugger may have trouble dealing with the code. Always debug with -O0
some compiler enforce that -g means -O0 Some compiler will flat out tell you that higher levels of
optimization may break some code!
22
Compiler optimizations In this class we use gcc, which is free and pretty good
-Os: Optimize for size Some optimizations increase code size tremendously
Do a “man gcc” and look at the many optimization options one can pick and choose, or just use standard sets via O1, O2, etc.
The most fancy compilers are typically the ones done by vendors
You can’t sell a good machine if it has a bad compiler Compiler technology used to be really poor also, languages used to be designed without thinking of compilers
(FORTRAN, Ada) no longer true: every language designer has in-depth understanding
of compiler technology today
23
What can compilers do? Most of the techniques we’ve seen!!
Inlining Assignment of variables to registers
It’s a difficult problem Dead code elimination Algebraic simplification Moving invariant code out of loops Constant propagation Control flow simplification Instruction scheduling, reordering Strength reduction
e.g., add to pointers, rather than doing array index computation Loop unrolling and software pipelining Dead store elimination and many other......
24
What can compilers do? Most of the techniques we’ve seen!!
Inlining Assignment of variables to registers
It’s a difficult problem Dead code elimination Algebraic simplification Moving invariant code out of loops Constant propagation Control flow simplification Instruction scheduling, reordering Strength reduction
e.g., add to pointers, rather than doing array index computation Loop unrolling and software pipelining Dead store elimination and many other......
25
Instruction scheduling Modern computers have multiple functional units
that could be used in parallel Or at least ones that are pipelined
if fed operands at each cycle they can produce a result at each cycle
although a computation may require 20 cycles Instruction scheduling:
Reorder the instructions of a program e.g., at the assembly code level
Preserve correctness Make it possible to use functional units optimally
26
Instruction Scheduling One cannot just shuffle all instructions around Preserving correctness means that data
dependences are unchanged Three types of data dependences:
True dependencea = ...
... = a Output dependencea = ...a = ...
Anti dependence... = aa = ...
27
Instruction Scheduling Example... ...ADD R1,R2,R4 ADD R1,R2,R4ADD R2,R2,1 LOAD R4,@2ADD R3,R6,R2 ADD R2,R2,1LOAD R4,@2 ADD R3,R6,R2... ... Since loading from memory can take many cycles,
one may as well do is as early as possible Can’t move instruction earlier because of anti-
dependence on R4
28
Software Pipelining
Fancy name for “instruction scheduling for loops” Can be done by a good compiler
First unroll the loop Then make sure that instructions can happen in parallel
i.e., “scheduling” them on functional units Let’s see a simple example
29
Example
Source code: for(i=0;i<n;i++) sum += a[i]
Loop body in assembly:
Unroll loop &allocate registers May be very difficult
r1 = L r0--- ;stall r2 = Add r2,r1r0 = add r0,4
r1 = L r0--- ;stall r2 = Add r2,r1r0 = Add r0,12r4 = L r3--- ;stall r2 = Add r2,r4r3 = add r3,12r7 = L r6--- ;stall r2 = Add r2,r7r6 = add r6,12r10 = L r9--- ;stall r2 = Add r2,r10r9 = add r9,12
30
Example (cont.)
r1 = L r0r4 = L r3 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = add r9,12 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = add r9,12 r2 = Add r2,r1 r7 = L r6 . . .r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = add r3,12 r2 = Add r2,r7r6 = add r6,12 Add r2,r10 r9 = add r9,12
Schedule Unrolled Instructions, exploiting instructionlevel parallelism if possible
Identifyrepeatingpattern(kernel)
31
Example (cont.)
Loop becomes:
r1 = L r0r4 = L r3 r2 = Add r2,r1 r7 = L r6 r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9 r3 = Add r3,12 r2 = Add r2,r7 r1 = L r0 r6 = Add r6,12 r2 = Add r2,r10 r4 = L r3 r9 = Add r9,12 r2 = Add r2,r1 r7 = L r6
r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = Add r3,12 r2 = Add r2,r7r6 = Add r6,12 Add r2,r10 r9 = Add r9,12
epilogue
prologue
kernel
32
Software Pipelining The “kernel” may require many registers and it’s nice to
know how to use as few as possible otherwise, one may have to go to cache more, which may negate the
benefits of software pipelining Dependency constraints must be respected
May be very difficult to analyze for complex nested loops Software pipelining with registers is a very well-known NP-
hard program
33
Limits to Compiler Optimization Behavior that may be obvious to the programmer can be
obfuscated by languages and coding styles e.g., data ranges may be more limited than variable types suggest
e.g., using an “int” in C for what could be an enumerated type Most analysis is performed only within procedures
whole-program analysis is too expensive in most cases Most analysis is based only on static information
compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative
cannot perform optimization if it changes program behavior under any realizable circumstance
even if circumstances seem quite bizarre and unlikely
34
Good practice Writing code for high performance means
working hand-in-hand with the compiler #1: Optimize things that we know the
compiler cannot deal with For instance the “blocking” optimization for matrix
multiplication may need to be done by hand But some compiler may find the best i-j-k
ordering!! #2: Write code so that the compiler can do its
optimizations Remove optimization blockers
35
Optimization blocker: aliasing Aliasing: two pointers point to the same location If a compiler can’t tell what a pointer points at, it must
assume it can point at almost anything Example:
void foo(int *q, int *p) {*q = 3;*p++;*q *= 4;}
cannot be safely optimized to: *p++;*q = 12;
because perhaps p = q Some compilers have pretty fancy aliasing analysis
capabilities
37
Blocker: Function Callsum = 0;for (i = 0; i <= fact(n); i++)
sum += i; A compiler cannot optimize this because
function fact may have side-effects e.g., modifies global variables
Function May Not Return Same Value for Given Arguments Depends on other parts of global state, which may be modified in the loop
Why doesn’t compiler look at the code for fact? Linker may overload with different version
Unless declared static Interprocedural optimization is not used extensively due to cost Inlining can achieve the same effect for small procedures
Again: Compiler treats procedure call as a black box Weakens optimizations in and around them
38
Other Techniques Use more local variables
while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}
register float f0 = filter[0];register float f1 = filter[1];register float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}
Helps some compilers
39
Other Techniques Replace pointer updates for strided memory addressing with
constant array offsets
f0 = *r8; r8 += 4;f1 = *r8; r8 += 4;f2 = *r8; r8 += 4;
f0 = r8[0];f1 = r8[4];f2 = r8[8];r8 += 12;
Some compilers are betterat figuring this out thanothers
Some systems may go faster with option #1, some others with option #2!
40
Bottom line Know your compilers
Some are great Some are not so great Some will not do things that you think they should do
often because you forget about things like aliasing There is not golden rule because there are some
system-dependent behaviors Although the general principles typically holds
Doing all optimization by hand is a bad idea in general
41
By-hand Optimization is bad?
Turned array accesses into pointer dereferences
Assign to each element of c just once
for(i = 0; i < SIZE; i++) { int *orig_pa = &a[i][0]; for(j = 0; j < SIZE; j++) { int *pa = orig_pa; int *pb = &a[0][j]; int sum = 0; for(k = 0; k < SIZE; k++) { sum += *pa * *pb; pa++; pb += SIZE; } c[i][j] = sum; }}
for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) {
c[i][j]+=a[i][k]*b[k][j]; } }}
42
Results (Courtesy of CMU)
R10000 Simple Optimized
cc –O0 34.7s 27.4s
cc –O3 5.3s 8.0s
egcc –O9 10.1s 8.3s
21164 Simple Optimized
cc –O0 40.5s 12.2s
cc –O5 16.7s 18.6s
egcc –O0 27.2s 19.5s
egcc –O9 12.3s 14.7s
Pentium II Simple Optimized
egcc –O9 28.4s 25.3s
RS/6000 Simple Optimized
xlC –O3 63.9s 65.3s
43
Why is Simple Sometimes Better? Easier for humans and the compiler to understand
The more the compiler knows the more it can do Pointers are hard to analyze, arrays are easier You never know how fast code will run until you
time it The transformations done by hand good optimizers
will often do for us And they will often do a better job than we can do
Pointers may cause aliases and data dependences where the array code had none
44
Bottom LineHow should I write my programs, given that I have a good,
optimizing compiler? Don’t: Smash Code into Oblivion
Hard to read, maintain & ensure correctness Do:
Select best algorithm Write code that’s readable & maintainable
Procedures, recursion, without built-in constant limits Even though these factors can slow down code
Eliminate optimization blockers Allows compiler to do its job
Account for cache behavior Focus on Inner Loops
Use a profiler to find important ones!
45
Memory One constant issue that unfortunately compilers do
not do very well with is memory and locality Although some recent compilers have gotten pretty smart
about it
Let’s look at this in detail because the ideas apply strongly to high performance for concurrent programs No point in writing a concurrent program if its sequential
components are egregiously suboptimal
46
The Memory Hierarchy
CPU
regs
Cache
Memory disk
Cache
register reference
L2-cache(SRAM)
reference
memory (DRAM)reference
disk reference
L1-cache(SRAM)
reference
larger, slower, cheaper
sub ns 1-2 cycles 20 cycleshundreds
cycles10 cycles
Cache
L3-cache(DRAM)
reference tens of thousandscycles
Spatial locality: having accessed a location, a nearby location is likely to be accessed next
Therefore, if one can bring in contiguous data items “close” to the processor at once, then perhaps a sequence of instructions will find them ready for use
Temporal locality: having accessed a location , this location is likely to be accessed again
Therefore, if one can keep recently accessed data items “close” to the processor, then perhaps the next instructions will fin them ready for use.
Numbers roughly based on 2005 Intel P4 processors with multi GHz clock rates
47
Caches There are many issues regarding cache design
Direct-mapped, associative Write-through, Write-back How many levels etc.
But this belongs to a computer architecture class Question: Why should the programmer care? Answer: Because code can be re-arranged to
improve locality And thus to improve performance
48
Example #1: 2-D Array Initializationint a[200][200]; int a[200][200];for (i=0;i<200;i++) { for (j=0;j<200;j++) { for (j=0;j<200;j++) { for (i=0;i<200;i++) { a[i][j] = 2; a[i][j] = 2; } }} }
Which alternative is best? i,j? j,i?
To answer this, one must understand the memory layout of a 2-D array
49
Row-Major C uses Row-Major First optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;
Second optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;
50
Counting cache misses
nxn 2-D array, element size = e bytes, cache line size = b bytes
memory/cache line
memory/cache line
One cache miss for every cache line: n2 x e /b Total number of memory accesses: n2
Miss rate: e/b Example: Miss rate = 4 bytes / 64 bytes = 6.25%
Unless the array is very small
One cache miss for every access Example: Miss rate = 100%
Unless the array is very small
51
Array Initialization in C
First optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;
Second optionint a[200][200];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;
Good Locality
52
Performance Measurements
Option #1int a[X][X];for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;
Option #2int a[X][X];for (j=0;j<200;j++) for (i=0;i<200;i++) a[i][j]=2;
Experiments on my laptop
0
5
10
15
20
25
30
0 200 400 600 800 1000 12002-D Array Dimension
Exec
utio
n Ti
me
Option #1 Option #2
Note that other languages use column major e.g., FORTRAN
53
Matrix Multiplication The previous example was very simple But things can get more complicated very quickly Let’s look at a simple program to multiply two
square matrices A fundamental operation in linear algebra
Linear system resolution Computing the transitive closure of a graph etc.
Probably the most well-studied problem in HPC clever algorithms clever implementations
54
Matrix Multiplication A = [aij]i,j=1,...,N
B = [bij]i,j=1,...,N
C = A x B = [cij]i,j=1,...,N
i
j cij
ai1
b1i
b2i
ai2
xx .
. .
Like most linear algebra operations, this formula can be translated into a very simple computer program that just “follows” the math
55
Matrix Multiplication Algorithm All matrices are stored in 2-D arrays of dimension NxN.
int i,j,k; double a[N][N], b[N][N], c[N][N]; ... initialization of a and b ... for (i=0;i<N;i++) for (j=0;j<N;j++) {
c[i,j] = 0.0;for (k=0;k<N;k++) {
c[i,j] += a[i,k] * b[k,j];}
}}
56
How good is this algorithm? This algorithm is good because:
It takes only a few lines It is a direct mapping to the formula for cij Anybody should be able to understand what it
does by just looking at it It almost certainly has no bug because it is so
simple This algorithm is bad because:
It has terrible performance because it ignores the fact that the underlying computer has a memory hierarchy
57
First Performance Improvementfor (i=0;i<N;i++) for (j=0;j<N;j++) {
c[i][j] = 0.0;for (k=0;k<N;k++) {
c[i][j] += a[i][k] * b[k][j];}
}}
Note that it is assume the compiler will remove c[i][j] form the inner loop, unroll loops, etc.
First idea: Switching loops around? After all it worked for array initialization
58
Loop permutations in MatMulfor (i=0;i<N;i++)for (j=0;j<N;j++)
for (k=0;k<N;k++)c[i,j] += a[i][k] * b[k][j];
There are 6 possible orders for the three loops i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i
Each order corresponds to a different access patterns of the matrices
Let’s focus on the inner loop, as it is the one that’s executed most often
59
Inner Loop Memory Accesses Each matrix element can be accessed in three modes in the
inner loop Constant: doesn’t depend on the inner loop’s index Sequential: contiguous addresses Stride: non-contiguous addresses (N elements apart)
c[i][j] += a[i][k] * b[k][j]; i-j-k: Constant Sequential Strided i-k-j: Sequential Constant Sequential j-i-k: Constant Sequential Strided j-k-i: Strided Strided Constant k-i-j: Sequential Constant Sequential k-j-i: Strided Strided Constant
60
Loop order and Performance Constant access is better than sequential
access it’s always good to have constants in loops
because they can be put in registers (as we’ve seen in our very first optimization)
Sequential access is better than strided access sequential access is better than strided because
it utilizes the cache better Let’s go back to the previous slides
61
Best Loop Ordering?c[i][j] += a[i][k] * b[k][j];
i-j-k: Constant Sequential Stridedi-k-j: Sequential Constant Sequentialj-i-k: Constant Sequential Stridedj-k-i: Strided Strided Constantk-i-j: Sequential Constant Sequentialk-j-i: Strided Strided Constant
k-i-j and i-k-j should have the best performance i-j-k and j-i-k should be worse j-k-i and k-j-i should be the worst
You will measure this in the first (warm-up) assignment
62
How good is the best ordering? Let us assume that i-k-j is best How many cache misses?
for (i=0;i<N;i++) for (k=0;k<N;k++){ sum=0; for (j=0;j<N;j++) sum+=a[i][k]*b[k][j]; c[i,j] = sum; }
Clearly this is not easy to compute e.g., if the matrix is twice the size of the cache, there is a lot of
loading/evicting and obtaining a formula would be complicated Let L be the cache size in number of matrix elements How about a very coarse approximation, by assuming that
the matrix is much larger than the cache? determine what matrix pieces are loaded/written Figure out the expected number of cache misses
63
Slow Memory Operationsfor (i=0;i<N;i++) // (1) read row i of a into cache // (2) write row i of c back to memory for (k=0;k<N;k++) // (3) read column j of b into cache for (j=0;j<N;j++) c[i,j]+=a[i][k]*b[k][j]; L: cache line size (1): N * (N / L) cache misses (2): N * (N / L) cache misses (3): N * N * N cache misses
Although the access to B is sequential, it’s sequential along the column and the matrix is store in row-major fashion!
Total: 2N2/L + N3 ≈ N3 (for large n)
64
Bad News ≈ N3 slow memory operations and 2N3 arithmetic operations Ratio ops / mem ≈ 2 This is bad news because we know that computer architectures are NOT
balanced and memory operations are orders of magnitude slower than arithmetic operations
Therefore, the memory is still the bottleneck for this implementation of matrix multiplication (the ratio should be much higher)
BUT: we have only N2 matrix elements, how come we perform N3 slow memory accesses?
Because we access matrix B very inefficiently, trying to load entire columns one after the other
Lesson: counting the number of operations and comparing it with the size of the data is not sufficient to ascertain that an algorithm will not suffer from the memory bottleneck
65
Better cache reuse? Since we really need only N2 elements, perhaps there is a better way to
reorganize the operations of the matrix multiplication for a higher number of cache hits
Possible because ‘+’ and ‘*’ are associative and commutative Researchers have spent a lot of time trying to find out the best ordering There are even theorems!
Let q = ratio of operations to slow memory accesses q must be as high as possible to remove the memory bottleneck [Hong&Kung 1981] Any reorganization of the algorithm is limited to q =
O(√M), where M is the size of the cache (in number of elements) obtained with a lot of unrealistic assumptions about the cache still shows that q won’t scale with N, unlike what one may think when dividing 2n3
by n2.
66
“Blocked” Matrix Multiplication
One problem with our implementation is that we try to access entire columns of matrix B.
What about accessing only a subset of a column, or of multiple columns, at a time?
67
“Blocked” Matrix Multiplication
i
j j
i
A BC
cache line
Key idea: reuse the other elements ineach cache line as much as possible
68
“Blocked” Matrix Multiplication
i
j j
i
A BC
cache line
May as well compute ci,j+1 since one loads column j+1 ofB in the cache lines anyway.But must reorder the operations as follows compute the first b terms of cij, compute the first b terms of ci,j+1
compute the next b terms of cii, compute the next b terms of cij+1
.....
b elements
b el
emen
ts
69
“Blocked” Matrix Multiplication
i
j j
i
A BC
cache line
May as well compute a whole subrow of C, with the same reordering of the operations. But by computing a whole row of C, then one has to load all columns of B, which one has to do again for computing the next row of C.Idea: reuse the blocks of B that we have just loaded.
70
“Blocked” Matrix Multiplication
i
j j
i
A BC
cache line
Order of the operation:Compute the first b terms of all cij values in the C blockCompute the next b terms of all cij values in the C block. . .Compute the last b terms of all cij values in the C block
71
“Blocked” Matrix Multiplication
C11
C22 = A21B12 + A22B22 + A23B32 + A24B42
4 matrix multiplications 4 matrix additions Main Point: each multiplication operates on small “block” matrices, whose size may be chosen so that they fit in the cache.
C12 C13 C14
C21 C22 C23 C24
C31 C32 C43 C34
C41 C42 C43 C44
A11 A12 A13 A14
A21 A22 A23 A24
A31 A32 A33 A34
A41 A42 A43 A144
B11 B12 B13 B14
B21 B22 B23 B24
B32 B32 B33 B34
B41 B42 B43 B44
N = 4 * b
72
Blocked Algorithm
The blocked version of the i-j-k algorithm is written simply as
for (i=0;i<N/B;i++) for (j=0;j<N/B;j++) for (k=0;k<N/B;k++) C[i][j] += A[i][k]*B[k][j]
where B is the block size (which we assume divides N)
where X[i][j] is the block of matrix X on block row i and block column j
where “+=“ means matrix addition where “*” means matrix multiplication
73
Cache Misses?for (i=0;i<N/B;i++) for (j=0;j<N/B;j++) // (1) write block C[i][j] to memory for (k=0;k<N/B;k++)
// (2) Load block A[i][k] from memory // (3) Load block B[k][j] from memory
C[i][j] += A[i][k]*B[k][j]
(1): (N/b)*(N/b)*b*b (2): (N/b)*(N/b)*(N/b)*b*b (3): (N/b)*(N/b)*(N/b)*b*b Total: N2 + 2N3/b ≈ 2N3/b
74
Performance? Slow memory accesses ≈ 2N3/b Number of operations = 2N3
Therefore, ratio ops / mem ≈ b This ratio should be as high as possible (Compare to the value of 2 that we obtained with the non-
blocked implementation) This implies that one should make the block size as
large as possible But, if we take this result to the extreme, then the
block size should be equal to N!! This clearly doesn’t make sense because then we’re back
to the non-blocked implementation
75
Maximum Block Size The blocking optimization only works if the
blocks fit in cache That is, 3 blocks of size bxb must fit in
memory (for A, B, and C) Let M be the cache size (in elements) We must have: 3b2 ≤ M, or b ≤ √(M/3) Therefore, in the best case, ratio of number
of operations to slow memory accesses is: √(M/3)
76
Necessary cache size Therefore, given a machine with some ratio of arithmetic
operation speed to slow memory speed, one can compute the cache size necessary to run blocked matrix multiplication so that the processor never waits for memory
Cache Size (KB)
Ultra 2i 14.8Ultra 3 4.7Pentium 3 0.9Pentium 3M 2.4Power3 1.8Power4 5.4Itanium1 31.1Itanium2 0.7
77
TGE Case Study
C00=C00+A00*B00C00=C00+A01*B10
C01=C01+A00*B01C01=C01+A01*B11
C10=C10+A10*B00C10=C10+A11*B10
C11=C11+A10*B01C11=C11+A11*B11
78
Grid Superscalar
79
TRAP/J
80
Integration1. We start with the original Sequential Code.2. Grid Superscalar – Stubs and Skeletons3. TRAP/J – Delegate and Wrapper class4. Compilation and binding of Delegate and
Original Application5. Grid Enabled Application is obtained!
81
Back on Our Case Study…
82
Results
83
Strassen’s Matrix Multiplication The traditional algorithm (w/ or w/o blocking) has O(n^3) flops Strassen discovered an algorithm with asymptotically lower flops
O(n^2.81) Consider a 2x2 matrix multiply
Normally it takes 8 multiplies, 7 adds Strassen does it with 7 multiplies and 18 adds, which is faster!
Let M = m11 m12 = a11 a12 b11 b12
m21 m22 = a21 a22 b21 b22
Let p1 = (a12 - a22) * (b21 + b22) p5 = a11 * (b12 - b22)
p2 = (a11 + a22) * (b11 + b22) p6 = a22 * (b21 - b11)
p3 = (a11 - a21) * (b11 + b12) p7 = (a21 + a22) * b11
p4 = (a11 + a12) * b22
Then m11 = p1 + p2 - p4 + p6
m12 = p4 + p5
m21 = p6 + p7
m22 = p2 - p3 + p5 - p7
Extends to nxn by divide&conquer
84
Overall Lessons Truly understanding application/cache behavior is
tricky But approximations can be obtained by which one
can reason about how to improve an implementation
The notion of blocking is a common recurrence in many algorithms and applications
You’ll write a blocked matrix multiplication in a Programming Assignment not hard, but tricky with indices and loops
85
Automatic Program Generation It is difficult to optimize code because
There are many possible options for tuning/modifying the code These options interact in complex ways with the compiler and the
hardware This is really an “optimization problem”
The objective function is the code’s performance The feasible solutions are all possible ways to implement the
software Typically a finite number of implementation decisions are to be made Each decision can take a range of values
e.g., the 7th loop in the 3rd function can be unrolled 1, 2, ..., 20 times e.g., the “block size” could be 2x2, 4x4, ..., 400x400 e.g., function could be recursive or iterative
And one needs to do it again and again for different platforms
86
Automatic Program Generation What is good at solving hard optimization
problems? computers
Therefore, a computer program could generate the computer program with the best performance Could use a brute force approach: try all possible
solutions but there is an exponential number of them
Could use genetic algorithms Could use some ad-hoc optimization technique
87
Matrix Multiplication We have seen that for matrix multiplication
there are several possible ways to optimize the code block size optimization flag to the compiler order of loops ...
It is difficult to find the best one People have written automatic matrix
multiplication program generators!
88
The ATLAS Project ATLAS is a software that you can download and run on most
platforms. It runs for a while (perhaps a couple of hours) and generates
a .c file that implements matrix multiplication! ATLAS optimizes for
Instruction cache reuse Floating point instruction ordering
pipeline functional units Reducing loop overhead Exposing parallelism
multiple functional units Cache reuse
89
ATLAS (500x500 matrices)
ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
Architectures
MFL
OPS
Vendor BLASATLAS BLASF77 BLAS
Source: Jack Dongarra
90
Conclusions Programming for performance means
working with the compiler knowing its limitations knowing its capabilities if it is unhindered
Finding the optimal code is really difficult, but dealing with locality is paramount
Automatic approaches for generating the code have been very successful in some cases
Top Related