Download - CS 3214 Computer Systems

CS 3214Computer Systems

Godmar Back

Lecture 9

Announcements

• Stay tuned for Exercise 5• Project 2 due Sep 30• Auto-fail rule 2:

– Need at least Firecracker to blow up to pass class.

CS 3214 Fall 2010

CODE OPTIMIZATIONPart 2

CS 3214 Fall 2010

Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP)

Randal E. Bryant and David R. O'Hallaron

http://csapp.cs.cmu.edu/public/lectures.html

http://www.cs.cmu.edu/~bryant

http://www.cs.cmu.edu/~droh

Roles of Programmer vs Compiler

• Programmer:– Choice of algorithm, Big-O– Manual application of some

optimizations– Choice of program

structure that’s amenable to optimization

– Avoidance of “optimization blockers”

CS 3214 Fall 2010

High-Level

Low-Level

Com

pile

r

Pro

gram

me

r

Roles of Programmer vs Compiler

• Optimizing Compiler– Applies transformations that

preserve semantics, but reduce amount of, or time spent in computations

– Provides efficient mapping of code to machine:• Selects and orders code• Performs register allocation

– Usually consists of multiple stages

CS 3214 Fall 2010

High-Level

Low-Level

Com

pile

r

Pro

gram

me

r

Eliminating Memory Accesses, Take 1

• Registers are faster than memory

CS 3214 Fall 2010

double sp1(double *x, double *y){ double sum = *x * *x + *y * *y; double diff = *x * *x - *y * *y; return sum * diff;}

sp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 mulsd %xmm1, %xmm1 mulsd %xmm2, %xmm2 movapd %xmm1, %xmm0 subsd %xmm2, %xmm1 addsd %xmm2, %xmm0 mulsd %xmm1, %xmm0 ret

How many memoryaccesses?

Number of memory accesses notrelated to how often pointerdereferences occur in source code


• Order of accesses matters

CS 3214 Fall 2010

void sp1(double *x, double *y, double *sum, double *prod){ *sum = *x + *y; *prod = *x * *y;}

sp1: movsd (%rdi), %xmm0 addsd (%rsi), %xmm0 movsd %xmm0, (%rdx) movsd (%rdi), %xmm0 mulsd (%rsi), %xmm0 movsd %xmm0, (%rcx) ret


• Compiler doesn’t know that sum or prod will never point to same location as x or y!

CS 3214 Fall 2010

void sp2(double *x, double *y, double *sum, double *prod){ double xlocal = *x; double ylocal = *y;

*sum = xlocal + ylocal; *prod = xlocal * ylocal;}

sp2: movsd (%rdi), %xmm0 movsd (%rsi), %xmm2 movapd %xmm0, %xmm1 mulsd %xmm2, %xmm0 addsd %xmm2, %xmm1 movsd %xmm1, (%rdx) movsd %xmm0, (%rcx) ret



Inlining

• Substitute body of called function into the caller– *before subsequent optimizations are applied*

• Current compilers do this aggressively• Almost never a need for doing this

manually (e.g., via #define)

CS 3214 Fall 2010

Inlining Example

CS 3214 Fall 2010

void sp1(double *x, double *y, double *sum, double *prod){ *sum = *x + *y; *prod = *x * *y;}

double outersp1(double *x, double *y){ double sum, prod;

sp1(x, y, &sum, &prod); return sum > prod ? sum : prod;}

outersp1: movsd (%rdi), %xmm1 movsd (%rsi), %xmm2 movapd %xmm1, %xmm0 mulsd %xmm2, %xmm1 addsd %xmm2, %xmm0 maxsd %xmm1, %xmm0 ret

Case Study: Vector ADT

• Proceduresvec_ptr new_vec(int len)

• Create vector of specified lengthint get_vec_element(vec_ptr v, int index, int *dest)

• Retrieve vector element, store at *dest• Return 0 if out of bounds, 1 if successful

int *get_vec_start(vec_ptr v)

• Return pointer to start of vector data

– Similar to array implementations in Pascal, ML, Java• E.g., always do bounds checking

lengthdata

0 1 2 length–1

CS 3214 Fall 2010

Optimization Example

• Procedure–Compute sum of all elements of vector–Store result at destination location

void combine1(vec_ptr v, int *dest){ int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; }}

CS 3214 Fall 2010

Time Scales

• Absolute Time– Typically use nanoseconds: 10–9 seconds– Time scale of computer instructions

• Clock CyclesExample: rlogin cluster machines: 2GHz

2 X 109 cycles per second– Clock period = 0.5ns– Most modern architectures provide way to directly

read cycle counter: “TSC” – “time stamp counter”• But: can be tricky because it captures OS interaction as

wellCS 3214 Fall 2010

Cycles Per Element• Convenient way to express performance of program that

operators on vectors or lists

Length = n T = CPE*n + Overhead

0

100

200

300

400

500

600

700

800

900

1000

0 50 100 150 200

Elements

Cyc

les

vsum1Slope = 4.0

vsum2Slope = 3.5

CS 3214 Fall 2010

Optimization Example

• Procedure– Compute sum of all elements of integer vector– Store result at destination location– Vector data structure and operations defined via abstract data type

• Pentium II/III Performance: Clock Cycles / Element– 42.06 (Compiled -g) 31.25 (Compiled -O2)


CS 3214 Fall 2010

Understanding Loop

• Inefficiency– Procedure vec_length called every iteration– Even though result always the same

void combine1-goto(vec_ptr v, int *dest){ int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done:}

1 iteration

CS 3214 Fall 2010

Move vec_length Call Out of Loop

• Optimization– Move call to vec_length out of inner loop

• Value does not change from one iteration to next• Code motion

– CPE: 20.66 (Compiled -O2)• vec_length requires only constant time, but significant overhead

void combine2(vec_ptr v, int *dest){ int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; }}

CS 3214 Fall 2010

void lower(char *s){ int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a');}

Code Motion Example #2

CS 3214 Fall 2010

• Convert string from upper to lower• Here: asymptotic complexity becomes

O(n^2)!

Lower Case Conversion Performance

– Time quadruples when double string length– Quadratic performance

lower1

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

CS 3214 Fall 2010

Performance after Code Motion

– Time doubles when double string length– Linear performance

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

String Length

CP

U S

eco

nd

s

lower1 lower2

CS 3214 Fall 2010

Optimization Blocker: Procedure Calls• Why couldn’t the compiler move vec_len or strlen out of the

inner loop?– Procedure may have side effects

• Alters global state each time called

– Function may not return same value for given arguments• Depends on other parts of global state• Procedure lower could interact with strlen

• What if compiler looks at code? Or inlines them?– even then, compiler may not be able to prove that the same result is

obtained, or the possibility of aliasing may require repeating the operation; and compiler must preserve any side-effects

– interprocedural optimization is expensive, but compilers are continuously getting better at it

• For instance, take into account if a function reads or writes to global memory

– Today’s compilers are different from the compilers 5 years ago and will be different from those 5 years from now

CS 3214 Fall 2010

Remove Bounds Checking

• Optimization– Avoid procedure call to retrieve each vector element

• Get pointer to start of array before loop• Within loop just do pointer reference• Not as clean in terms of data abstraction

– CPE: 6.00 (Compiled -O2)• Procedure calls are expensive!• Bounds checking is expensive

void combine3(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i];}

CS 3214 Fall 2010

Eliminate Unneeded Memory Refs

• Optimization– Don’t need to store in destination until end– Local variable sum held in register– Avoids 1 memory read, 1 memory write per cycle– CPE: 2.00 (Compiled -O2)

• Memory references are expensive!

void combine4(vec_ptr v, int *dest){ int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum;}

CS 3214 Fall 2010

Detecting Unneeded Memory Refs.

• Performance–Combine3

• 5 instructions in 6 clock cycles• addl must read and write memory

–Combine4• 4 instructions in 2 clock cycles

.L18:movl (%ecx,%edx,4),%eaxaddl %eax,(%edi)incl %edxcmpl %esi,%edxjl .L18

Combine3

.L24:addl (%eax,%edx,4),%ecx

incl %edxcmpl %esi,%edxjl .L24

Combine4

CS 3214 Fall 2010

Pointer Code

• Optimization– Use pointers rather than array references– CPE: 3.00 (Compiled -O2)

• Oops! Worse than the best array version

Warning: Some compilers do better job optimizing array code

void combine4p(vec_ptr v, int *dest){ int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum;}

CS 3214 Fall 2010

Big question:Should you rewrite your array code as pointer code to “help” thecompiler?

Pointer vs. Array Code Inner Loops• Array

Code

• Pointer Code

• Performance– Array Code: 4 instructions in 2 clock cycles– Pointer Code: Almost same 4 instructions in 3 clock

cycles

.L24: # Loop:addl (%eax,%edx,4),%ecx # sum += data[i]incl %edx # i++cmpl %esi,%edx # i:lengthjl .L24 # if < goto Loop

.L30: # Loop:addl (%eax),%ecx # sum += *dataaddl $4,%eax # data ++cmpl %edx,%eax # data:dendjb .L30 # if < goto Loop

CS 3214 Fall 2010

Pointer vs. Array Code

• Difficult to predict which would be faster• Compiler may transform array to pointer

form if it deems it useful• Compiler as a rule optimizes array code as

good or better as it does pointer code• Writing as array code allows use of index

variable in index-based address modes• Should prefer array form for readability

CS 3214 Fall 2010

Lessons so far (1)

• Does not matter how many local variables or temporaries you introduce

• Does not matter if you use constants, expressions, or const local variables, or write-once local variables– So optimize for readability, not the compiler

• Does not matter how many pointer derefs you have in your code (*, [ ], ->) as long as there’s no intervening write/store to memory– If there is, compiler must repeat the ‘load’– Avoid introducing ‘stores’ by introducing local temporaries that

defer the write to memory whenever possible• Don’t rewrite array code into pointer form

CS 3214 Fall 2010

Lessons so far (2)• Inlining changes the game substantially

– Compiler will aggressively inline functions whose definitions occur in same compilation unit

– Does not matter if declared ‘static’ or not; but must be static if included in multiple files to avoid multiple strong symbols

• Can remove abstraction penalty entirely in many cases– No need for manual inlining, using macros

• Inlining can generate better code because it enables optimizations not possible without knowing the caller:– potential for aliasing of pointer arguments may be reduced, allowing for more

precise and less-conservative points-to analysis– May be able to remove bounds-checks even (next slide)

• Caveat: inlining is not possible if target of the call is not known to the compiler– E.g. non-final, non-private methods in Java, or “virtual” methods in C++; so

declare your methods final or private in Java whenever possible

CS 3214 Fall 2010

combine1 Example under inlining

• Procedure–Compute sum of all elements of vector–Store result at destination location


CS 3214 Fall 2010

/* * Retrieve vector element and store at dest. * Return 0 (out of bounds) or 1 (successful) */int get_vec_element(vec_ptr v, int index, data_t *dest){ if (index < 0 || index >= v->len) return 0; *dest = v->data[index]; return 1;}

/* Return length of vector */ int vec_length(vec_ptr v){ return v->len;}

Form after inlining

CS 3214 Fall 2010

void combine1(vec_ptr v, int *dest){ int i; *dest = 0;

for (i = 0; i < v->len; i++) { int val; if (i < 0 || i >= v->len) // become redundant! { ret = 0; goto skip; }

val = v->data[index]; ret = 1; skip: /* caller ignored return value */ *dest += val; }}

combine1: pushl %ebp movl %esp, %ebp movl 12(%ebp), %ecx pushl %esi movl 8(%ebp), %esi pushl %ebx movl $0, (%ecx) movl (%esi), %eax testl %eax, %eax jle .L375 movl 4(%esi), %ebx xorl %edx, %edx .p2align 4,,7 .L374: movl (%ebx,%edx,4), %eax addl $1, %edx addl %eax, (%ecx) cmpl %edx, (%esi) jg .L374.L375: popl %ebx popl %esi popl %ebp ret