Program’Op*miza*on’and’Analysis’lu/cse467s/slides/program.pdf · Instruc*ons’(ARM7) ... "...
Transcript of Program’Op*miza*on’and’Analysis’lu/cse467s/slides/program.pdf · Instruc*ons’(ARM7) ... "...
Program Transforma*on
Chenyang Lu CSE 467S 2
compile assembly assemble assembly assembly
link load
assembly assembly object
HLL HLL HLL
Rela5ve Address
Absolute Address
Physical Address
executable
op#mize Analyze
What do we need to do?
Ø Understand optimization levels (-O1, -O2, etc.) q http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Ø Optimize HLL code.
Ø Analyze and optimize assembly code.
Ø Modifying compiler output requires care: q correctness;
q loss of hand-tweaked code.
Chenyang Lu CSE 467S 3
Goals Ø Optimizing for execution time. Ø Optimizing for energy/power.
Ø Optimizing for program size. Ø They may conflict with each other!
Chenyang Lu CSE 467S 4
Expression Simplifica*on
Ø Constant folding: q 8+1 = 9
Ø Algebraic: q a*b + a*c = a*(b+c)
Ø Strength reduction: q a*2 = a<<1
Chenyang Lu CSE 467S 5
Dead Code Elimina*on
Ø Dead code: #define DEBUG 0
if (DEBUG) dbg(p1);
Ø Eliminate by control flow analysis, constant folding.
Chenyang Lu CSE 467S 6
0
dbg(p1);
1
0
Instruc*ons (ARM7)
Ø Branch and link instruction: BL foo == MOV r14, r15 B foo
q r15 contains the current PC q Copies current PC to r14.
Ø To return from subroutine: MOV r15,r14
Chenyang Lu CSE 467S 8
Stack
Ø Use a stack to keep track of q parameters,
q return value,
q return address.
Ø Caller and callee access the stack in a consistent order. q Different compilers/programmers may follow different orders.
Ø Access the stack (ARM7) q r13 always points to the top of stack
q Push: STR r0, [r13, #4]!
q Pop: SUB r13, #4
Chenyang Lu CSE 467S 9
Stack Opera*ons
Ø Caller: call a function q Push parameters to stack q BL (r15 à r14; jump)
Ø Callee: receive a call q Read parameters from stack q Overwrite top of stack with return address (r14)
Ø Callee: return q Load PC with return address (on top of stack)
Ø Caller: receive a return q Pop callee’s return address from stack
Chenyang Lu CSE 467S 10
Nested func*on calls (ARM7) main() { f1(x); } void f1(int a) { f2(a); } ; f1 is called by main() LDR r0, [r13] ; load parameter into r0 from stack STR r14, [r13] ; store f1’s return addr. ; f1 calls f2() STR r0, [r13, #4]! ; push parameter for f2 to stack BL f2 ; branch and link to f2 ; return from f2() SUB r13, #4 ; pop f2’s parameter off stack ; f1 returns to main() LDR r15, [r13] ; restore register and return
Chenyang Lu CSE 467S 11
Func*on Inlining int foo(a,b,c) { return a + b -‐ c;} z = foo(w,x,y);
ð z = w + x -‐ y;
Ø Improve performance by eliminating function call overhead.
Ø May increase code size, but not always…
Ø Affect instruction cache behavior.
Chenyang Lu CSE 467S 12
Op*miza*on: Inlining
Chenyang Lu 13
• Inlining improves performance and reduces code size. • Why?
App Code size Code reduction
Data size
CPU reduction inlined noninlined
Surge 14794 16984 12% 1188 15% Maté 25040 27458 9% 1710 34% TinyDB 64910 71724 10% 2894 30%
Loop Unrolling
Ø Reduces loop overhead
for (i=0; i<4; i++) a[i] = b[i] * c[i]; ð for (i=0; i<4; i+=2) { a[i] = b[i] * c[]; a[i+1] = b[i+1] * c[i+1]; }
Chenyang Lu CSE 467S 15
Loop Overhead on ARM7
; loop initiation code MOV r0, #0 ; use r0 for loop counter MOV r8, #0 ; use separate index for arrays LDR r1, #4 ; buffer size MOV r2, #0 ; use r2 for f ADR r3, c ; load r3 with base of c[ ] ADR r5, x ; load r5 with base of x[ ]
; loop; L: LDR r4, [r3, r8] ; get c[i] LDR r6, [r5, r8] ; get x[i] MUL r4, r4, r6 ; compute c[i]x[i] ADD r2, r2, r4 ; add into sum ADD r8, r8, #4 ; add one word to array index ADD r0, r0, #1 ; add 1 to i CMP r0, r1 ; exit? BLT L ; if i < 4, continue
Chenyang Lu CSE 467S 16
Loop Fusion
Combines multiple loops: for (i=0; i<N; i++) a[i] = b[i] * 5; for (j=0; j<N; j++) w[j] = c[j] * d[j]; ð for (i=0; i<N; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i];
}
Necessary conditions Ø Loops share a same index
Ø No dependencies between two loops
Chenyang Lu CSE 467S 17
Code Mo*on
for (i=0; i<N*M; i++) z[i] = a[i] + b[i];
Chenyang Lu CSE 467S 18
i<N*M
i=0;
z[i] = a[i] + b[i];
i = i+1;
N
Y
i<X
i=0; X = N*M
One-‐Dimensional Array
Ø C array name points to 0th element:
Chenyang Lu CSE 467S 20
a[0]
a[1]
a[2]
a a[i] = *(a + i)
Two-‐Dimensional Array
Ø Row-major layout:
Chenyang Lu CSE 467S 21
a[0,0]
a[0,1]
a[1,0]
a[1,1] a[i][j] = *(a + i*M + j)
...
M
...
N
Chenyang Lu CSE 467S 22
for (i=0; i<N; i++) for (j=0; j<M; j++) z[i][j] = b[i][j];
zptr = z; bptr = b; for (i=0; i<N; i++)
for (j=0; j<M; j++) { zind = i*M+j; bind = i*M+j; *(zptr+zind)=*(bptr+bind) }
zptr = z; bptr = b; for (i=0; i<N; i++)
for (j=0; j<M; j++) { zbind = i*M+j; *(zptr+zbind)=*(bptr+zbind); }
zptr = z; bptr = b; zbind = 0; for (i=0; i<N; i++)
for (j=0; j<M; j++) { *(zptr+zbind)=*(bptr+zbind);
zbind++; }
induction variable elimination
strength reduction
Cache Analysis
Ø Loops use large quan55es of data (arrays) à cache conflicts
Chenyang Lu CSE 467S 23
Direct-‐Mapped Cache
Chenyang Lu CSE 467S 24
valid
=
tag index offset
hit value
tag data 1 0xabcd byte byte byte ...
byte
cache block
Array Conflicts in Cache
Chenyang Lu CSE 467S 25
a[0,0]
b[0,0]
main memory cache
1024 4099
...
1024
4099
for (i=0; i<N; i++) for (j=0; j<M; j++)
a[i][j] = a[i][j] + b[i][j];
256
Array Conflicts
Ø Array elements conflict because they are in the same line. Ø Solu5on: move one array.
Chenyang Lu CSE 467S 26
Sta*c Cache Locking
Ø Lock instructions in cache before execution. Ø Predictable execution time.
Ø Similarly, lock code and data in memory to avoid paging.
Chenyang Lu CSE 467S 27
Register Alloca*on
Ø Fit current variables in registers.
Ø Load once, use many times. ü Reduce number of cache/memory accesses. ü Improve performance.
ü Reduce energy consumption.
Chenyang Lu CSE 467S 28
Register Life*me Graph
1. w = a + b; 2. x = c + w; 3. y = c + d; 4. z = a - b;
abcdwxyz
1 2 3 4
Chenyang Lu CSE 467S 29
no. of needed register = 5
ATer Rescheduling…
1. w = a + b; 2. z = a - b; 3. x = c + w; 4. y = c + d;
abcdwxyz
1 2 3 4
Chenyang Lu CSE 467S 30
no. of needed register = 4
Cannot change dependencies between instructions!
Summary: Performance Op*miza*on
Ø Use registers efficiently.
Ø Optimize loops.
Ø Optimize function calls.
Ø Optimize cache behavior: q Avoid instruction conflicts by rewriting code, rescheduling; q Move conflicting scalar/array data can be moved.
Chenyang Lu CSE 467S 31
Execu*on Time Analysis
Ø Real-time systems must meet deadlines.
Ø Need to analyze execution time.
Chenyang Lu CSE 467S 32
Execu*on Time
Ø Affected by program path and instruction timing
Ø Program path depends on input data. q Sensor readings q User input
Ø Instruction timing depends on q pipelining q cache behavior – memory can be x10 slower than cache!
Chenyang Lu CSE 467S 33
Program Path
for (i=0, f=0; i<N; i++) f = f + c[i]*x[i];
• Loop initiation executed once.
• Loop test executed N+1 times.
• Loop body and index update executed N times.
Chenyang Lu CSE 467S 34
i<N
i=0; f=0;
f = f + c[i]*x[i];
i = i+1;
N
Y
Execu*on Time Metrics
Ø Difficult to predict execution time accurately. ���
Ø Average case q For “typical” data values q Soft real-time
Ø Worst case q For any possible input set
q Hard real-time
q Longest program path may NOT lead to worst-case execution time
Chenyang Lu CSE 467S 35
Analysis
Ø Analyze optimized assembly/binary code, not high-level language (HLL) code q HLL statement à many assembly/binary instructions
q Example: function calls
Ø Challenges q Program path depends on input data
q Pipelining, cache effects are hard to predict
q Analysis tends to be pessimistic
Chenyang Lu CSE 467S 37
Measurement
Ø CPU simulator q I/O may be hard to measure.
q May not be totally accurate.
Ø Time stamping q Requires instrumenting program.
q Timer granularity • Gettimeofday on Linux: ms • Gethrtime on Intel processors: read 64-bit clock cycle counter and return
the number of clock cycles since CPU was powered up or reset.
Ø Logic analyzer: limited logic analyzer memory depth.
Chenyang Lu CSE 467S 38
Output from a Logic Analyzer
Chenyang Lu CSE 467S 39
Timing diagram of event propagation on Mote Granularity: 50 microsecond
Trace-‐driven Analysis
Ø Record of the program path of a program.
Ø Help study cache behavior and power management policies.
Ø A useful trace q requires proper input values; q is large.
Chenyang Lu CSE 467S 40
Trace Genera*on
Ø Hardware capture q Logic analyzer
• Limited buffer space • Cannot observe on-chip cache
q Hardware assist in CPU • Pentium supports automatic tracing of branches
Ø Software q PC sampling
q Instrumentation instructions q Simulation
Chenyang Lu CSE 467S 41
Goals Ø Optimizing for execution time. Ø Optimizing for energy/power.
Ø Optimizing for program size.
Chenyang Lu CSE 467S 42
Op*mizing for Program Size
Ø Goals q Reduce memory cost;
q Reduce power consumption.
Ø Two opportunities: q Data; q Instructions.
Chenyang Lu CSE 467S 43
Reduce Data Size
Ø Reuse constants, variables, buffers in different parts of code. q Single-buffer in TinyOS.
q Pack multiple flags in one byte.
q Use shortest data type needed.
q Requires careful verification of correctness.
Ø Generate data using instructions.
Chenyang Lu CSE 467S 44
uint8_t i; for(i = 0; i < 1000; i++) { ... } // This loop will never terminate
Reduce Code Size
Ø Avoid loop unrolling.
Ø Inlining? q Size of function
q Number of calls
Ø Choose CPU with compact instructions. q Digital Signal Processors (DSP) tend to have smaller code.
Ø Some CPUs support dense instruction set q ARM Thumb, MIPS-16
Chenyang Lu CSE 467S 45
Code Compression
Ø Use sta5s5cal compression to reduce code size. Ø Decompress on-‐the-‐fly. Ø Need to handle jump addresses.
Chenyang Lu CSE 467S 46
CPU decompressor table
cache
main memory
0101101 0101101
LDR r0,[r4]