CS 3214Computer Systems
Godmar Back
Lecture 9
Announcements
• Stay tuned for Exercise 5• Project 2 due Sep 30• Auto-fail rule 2:
– Need at least Firecracker to blow up to pass class.
CS 3214 Fall 2010
CODE OPTIMIZATIONPart 2
CS 3214 Fall 2010
Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP)
Randal E. Bryant and David R. O'Hallaron
http://csapp.cs.cmu.edu/public/lectures.html
Roles of Programmer vs Compiler
• Programmer:– Choice of algorithm, Big-O– Manual application of some
optimizations– Choice of program
structure that’s amenable to optimization
– Avoidance of “optimization blockers”
CS 3214 Fall 2010
High-Level
Low-Level
Com
pile
r
Pro
gram
me
r
Roles of Programmer vs Compiler
• Optimizing Compiler– Applies transformations that
preserve semantics, but reduce amount of, or time spent in computations
– Provides efficient mapping of code to machine:• Selects and orders code• Performs register allocation
– Usually consists of multiple stages
CS 3214 Fall 2010
High-Level
Low-Level
Com
pile
r
Pro
gram
me
r
Eliminating Memory Accesses, Take 1
• Registers are faster than memory
CS 3214 Fall 2010
double sp1(double *x, double *y){ double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff;}
sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret
How many memoryaccesses?
Number of memory accesses notrelated to how often pointerdereferences occur in source code
Eliminating Memory Accesses, Take 2
• Order of accesses matters
CS 3214 Fall 2010
void sp1(double *x, double *y, double *sum, double *prod){ *sum = *x + *y; *prod = *x * *y;}
sp1: movsd (%rdi), %xmm0 addsd (%rsi), %xmm0 movsd %xmm0, (%rdx) movsd (%rdi), %xmm0 mulsd (%rsi), %xmm0 movsd %xmm0, (%rcx) ret
How many memoryaccesses?
• Compiler doesn’t know that sum or prod will never point to same location as x or y!
CS 3214 Fall 2010
void sp2(double *x, double *y, double *sum, double *prod){ double xlocal = *x; double ylocal = *y;
*sum = xlocal + ylocal; *prod = xlocal * ylocal;}
sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret
How many memoryaccesses?
Eliminating Memory Accesses, Take 3
Inlining
• Substitute body of called function into the caller– *before subsequent optimizations are applied*
• Current compilers do this aggressively• Almost never a need for doing this
manually (e.g., via #define)
CS 3214 Fall 2010
Inlining Example
CS 3214 Fall 2010
void sp1(double *x, double *y, double *sum, double *prod){ *sum = *x + *y; *prod = *x * *y;}
double outersp1(double *x, double *y){ double sum, prod;
sp1(x, y, &sum, &prod); return sum > prod ? sum : prod;}
outersp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 movapd %xmm1, %xmm0 mulsd %xmm2, %xmm1 addsd %xmm2, %xmm0 maxsd %xmm1, %xmm0 ret
Case Study: Vector ADT
• Proceduresvec_ptr new_vec(int len)
• Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)
• Retrieve vector element, store at *dest• Return 0 if out of bounds, 1 if successful
int *get_vec_start(vec_ptr v)
• Return pointer to start of vector data
– Similar to array implementations in Pascal, ML, Java• E.g., always do bounds checking
lengthdata
0 1 2 length–1
CS 3214 Fall 2010
Optimization Example
• Procedure–Compute sum of all elements of vector–Store result at destination location
void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}
CS 3214 Fall 2010
Time Scales
• Absolute Time– Typically use nanoseconds: 10–9 seconds– Time scale of computer instructions
• Clock CyclesExample: rlogin cluster machines: 2GHz
2 X 109 cycles per second– Clock period = 0.5ns– Most modern architectures provide way to directly
read cycle counter: “TSC” – “time stamp counter”• But: can be tricky because it captures OS interaction as
wellCS 3214 Fall 2010
Cycles Per Element• Convenient way to express performance of program that
operators on vectors or lists
Length = n T = CPE*n + Overhead
0
100
200
300
400
500
600
700
800
900
1000
0 50 100 150 200
Elements
Cyc
les
vsum1Slope = 4.0
vsum2Slope = 3.5
CS 3214 Fall 2010
Optimization Example
• Procedure– Compute sum of all elements of integer vector– Store result at destination location– Vector data structure and operations defined via abstract data type
• Pentium II/III Performance: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)
void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}
CS 3214 Fall 2010
Understanding Loop
• Inefficiency– Procedure vec_length called every iteration– Even though result always the same
void combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}
1 iteration
CS 3214 Fall 2010
Move vec_length Call Out of Loop
• Optimization– Move call to vec_length out of inner loop
• Value does not change from one iteration to next• Code motion
– CPE: 20.66 (Compiled -O2)• vec_length requires only constant time, but significant overhead
void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}
CS 3214 Fall 2010
void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}
Code Motion Example #2
CS 3214 Fall 2010
• Convert string from upper to lower• Here: asymptotic complexity becomes
O(n^2)!
Lower Case Conversion Performance
– Time quadruples when double string length– Quadratic performance
lower1
0.0001
0.001
0.01
0.1
1
10
100
1000
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
String Length
CP
U S
eco
nd
s
CS 3214 Fall 2010
Performance after Code Motion
– Time doubles when double string length– Linear performance
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
String Length
CP
U S
eco
nd
s
lower1 lower2
CS 3214 Fall 2010
Optimization Blocker: Procedure Calls• Why couldn’t the compiler move vec_len or strlen out of the
inner loop?– Procedure may have side effects
• Alters global state each time called
– Function may not return same value for given arguments• Depends on other parts of global state• Procedure lower could interact with strlen
• What if compiler looks at code? Or inlines them?– even then, compiler may not be able to prove that the same result is
obtained, or the possibility of aliasing may require repeating the operation; and compiler must preserve any side-effects
– interprocedural optimization is expensive, but compilers are continuously getting better at it
• For instance, take into account if a function reads or writes to global memory
– Today’s compilers are different from the compilers 5 years ago and will be different from those 5 years from now
CS 3214 Fall 2010
Remove Bounds Checking
• Optimization– Avoid procedure call to retrieve each vector element
• Get pointer to start of array before loop• Within loop just do pointer reference• Not as clean in terms of data abstraction
– CPE: 6.00 (Compiled -O2)• Procedure calls are expensive!• Bounds checking is expensive
void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}
CS 3214 Fall 2010
Eliminate Unneeded Memory Refs
• Optimization– Don’t need to store in destination until end– Local variable sum held in register– Avoids 1 memory read, 1 memory write per cycle– CPE: 2.00 (Compiled -O2)
• Memory references are expensive!
void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}
CS 3214 Fall 2010
Detecting Unneeded Memory Refs.
• Performance–Combine3
• 5 instructions in 6 clock cycles• addl must read and write memory
–Combine4• 4 instructions in 2 clock cycles
.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)incl %edxcmpl %esi,%edxjl .L18
Combine3
.L24:addl (%eax,%edx,4),%ecx
incl %edxcmpl %esi,%edxjl .L24
Combine4
CS 3214 Fall 2010
Pointer Code
• Optimization– Use pointers rather than array references– CPE: 3.00 (Compiled -O2)
• Oops! Worse than the best array version
Warning: Some compilers do better job optimizing array code
void combine4p(vec_ptr v, int *dest){ int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum;}
CS 3214 Fall 2010
Big question:Should you rewrite your array code as pointer code to “help” thecompiler?
Pointer vs. Array Code Inner Loops• Array
Code
• Pointer Code
• Performance– Array Code: 4 instructions in 2 clock cycles– Pointer Code: Almost same 4 instructions in 3 clock
cycles
.L24: # Loop:addl (%eax,%edx,4),%ecx # sum += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop
.L30: # Loop:addl (%eax),%ecx # sum += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop
CS 3214 Fall 2010
Pointer vs. Array Code
• Difficult to predict which would be faster• Compiler may transform array to pointer
form if it deems it useful• Compiler as a rule optimizes array code as
good or better as it does pointer code• Writing as array code allows use of index
variable in index-based address modes• Should prefer array form for readability
CS 3214 Fall 2010
Lessons so far (1)
• Does not matter how many local variables or temporaries you introduce
• Does not matter if you use constants, expressions, or const local variables, or write-once local variables– So optimize for readability, not the compiler
• Does not matter how many pointer derefs you have in your code (*, [ ], ->) as long as there’s no intervening write/store to memory– If there is, compiler must repeat the ‘load’– Avoid introducing ‘stores’ by introducing local temporaries that
defer the write to memory whenever possible• Don’t rewrite array code into pointer form
CS 3214 Fall 2010
Lessons so far (2)• Inlining changes the game substantially
– Compiler will aggressively inline functions whose definitions occur in same compilation unit
– Does not matter if declared ‘static’ or not; but must be static if included in multiple files to avoid multiple strong symbols
• Can remove abstraction penalty entirely in many cases– No need for manual inlining, using macros
• Inlining can generate better code because it enables optimizations not possible without knowing the caller:– potential for aliasing of pointer arguments may be reduced, allowing for more
precise and less-conservative points-to analysis– May be able to remove bounds-checks even (next slide)
• Caveat: inlining is not possible if target of the call is not known to the compiler– E.g. non-final, non-private methods in Java, or “virtual” methods in C++; so
declare your methods final or private in Java whenever possible
CS 3214 Fall 2010
combine1 Example under inlining
• Procedure–Compute sum of all elements of vector–Store result at destination location
void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}
CS 3214 Fall 2010
/* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */int get_vec_element(vec_ptr v, int index, data_t *dest){ if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1;}
/* Return length of vector */ int vec_length(vec_ptr v){ return v->len;}
Form after inlining
CS 3214 Fall 2010
void combine1(vec_ptr v, int *dest){ int i; *dest = 0;
for (i = 0; i < v->len; i++) { int val; if (i < 0 || i >= v->len) // become redundant! { ret = 0; goto skip; }
val = v->data[index]; ret = 1; skip: /* caller ignored return value */ *dest += val; }}
combine1: pushl %ebp movl %esp, %ebp movl 12(%ebp), %ecx pushl %esi movl 8(%ebp), %esi pushl %ebx movl $0, (%ecx) movl (%esi), %eax testl %eax, %eax jle .L375 movl 4(%esi), %ebx xorl %edx, %edx .p2align 4,,7 .L374: movl (%ebx,%edx,4), %eax addl $1, %edx addl %eax, (%ecx) cmpl %edx, (%esi) jg .L374.L375: popl %ebx popl %esi popl %ebp ret
Top Related