Binary Translation Using Peephole Superoptimizers
description
Transcript of Binary Translation Using Peephole Superoptimizers
![Page 1: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/1.jpg)
Binary Translation Using Peephole Superoptimizers
Sorav Bansal, Alex AikenStanford University
![Page 2: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/2.jpg)
Binary Translation
• Allow one ISA to run on another• Applications
– Portability (e.g., running legacy software)
– Virtualization– Backward and Forward Compatibility– On-chip binary translation– Java Virtual Machines
![Page 3: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/3.jpg)
Hypervisor
x86 hardware
x86 OS
x86app
x86app
Binary Translator
powerpcapp
powerpc OS
Binary Translation
x86 hardware
OS
x86app
x86app
Binary Translator
powerpcapp
x86 hardware
OS
x86app
x86appBinary Translator
powerpcapp
![Page 4: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/4.jpg)
Binary Translation Wish-list
Performance
Large Complex ISAs
Retargetability OS Compatibility
![Page 5: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/5.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 6: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/6.jpg)
Superoptimization
• Superoptimizer is a unique code generator that uses brute-force search to attempt to find the optimal code
Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}
On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1
![Page 7: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/7.jpg)
Superoptimization
• Enumerate all sequences up to a certain length
and
• Compare each enumerated sequence with target function for equivalence
![Page 8: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/8.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizationApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 9: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/9.jpg)
Peephole SuperoptimizationUse a superoptimizer to
automatically infer peephole optimizations
add $1, reg inc reg
mul $2, reg shl reg
… …Table of Peephole Optimizations
[S. Bansal, A. Aiken. Automatic Generation of Peephole Superoptimizers, ASPLOS 2006]
pattern replace-with
![Page 10: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/10.jpg)
Peephole SuperoptimizerStep 1
a.out
010001001011110100011101101011101010100010101010001010100010001010101001010100101010101001010000101011111101100101010101101111010010101001010100101010010101001110011111010010001101111011011101010001001101010101010101010101010101010101010100110100100101010101010101010101000011111101010111101010001111010101011101110110111011101110111010100110110010101011011
01…
01100101
mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Harvest instruction sequences that
can potentially be optimized.
Canonicalize and store them. Target Sequences
![Page 11: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/11.jpg)
Peephole Superoptimization
Step 2mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
Target Sequences
mov %eax, %ecx
add $333, %eax
inc (%eax)
…Brute force
Optimization Optimized Sequences
![Page 12: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/12.jpg)
Equivalence Test
ExecutionTest
BooleanTest
Two sequences
pass
fail fail
not-equivalent not-equivalent
equivalent
![Page 13: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/13.jpg)
Peephole Superoptimization
Step 3mov %eax, %ecxmov %ecx, %eax
sub $123, %eaxadd $456, %eax
movl (%eax), %ecxinc %ecxmovl %ecx, (%eax)
…
mov %eax, %ecx
add $333, %eax
inc (%eax)
…
Table of Peephole Optimizations
![Page 14: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/14.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 15: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/15.jpg)
Application to Binary Translation
• Our approach: Use lots of peephole transformations
pattern(ppc)
translate-to(x86)
shl %eax
add %ecx,%eax
addi r1,r1,1
mullw r1,r1,2
add r1,r1,r2
inc %eax
ppcx86register map
r1eax
r1eax
r1eax; r2ecx
![Page 16: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/16.jpg)
Peephole Binary Translation
mr r1, r2mr r2, r1
lis r1, 0x12ori r1, r1, 0x3456
ldl r2, (r1)addi r2, r2, 1stl r2, (r1)
…
mov %eax, %ecx
mov $0x123456, Mr1
inc (%eax)
…
r1 eaxr2 ecx
r1 Mr1
r1 eaxr2 ecx
…
source arch.(ppc)
register map destination arch.(x86)
![Page 17: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/17.jpg)
Register Map Selection
• The best code may require changing the register map from one code point to another
• The choice of register maps affects the choice of instruction selection and vice-versa
![Page 18: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/18.jpg)
Register Map Selection
li r1, 123addi r2, r2, 1subf r2, r1, r2ori r1, r1, 31
powerpc sequence:?x86 sequence:
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Cost Model
P0P1P2P3
exit
At entry: r1Mr1 ; r2Mr2
At exit: r1Mr1 ; r2Mr2
Example
![Page 19: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/19.jpg)
Register Map Selection
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
movl $123, Mr1r1 Mr1
0
10
incl Mr2r2 Mr2
0
10
subl Mr1, eaxr1 Mr1 ; r2 eax
10 10
exit
orl $31, Mr1 10r1 Mr1
0
10
Total 40Total 20
Grand Total 60
r1 Mr1 ; r2 Mr2
Instruction costsIf accesses memory, 10
Else, 1
Switching CostsRM or MR : 10
Greedy Strategy
P0:
P1:
P2:
P3:
![Page 20: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/20.jpg)
li r1, 123
r1 Mr1 ; r2 Mr2entry
addi r2,r2,1
subf r2,r1,r2
ori r1,r1,31
exit
movl $123, eaxr1 eax
10
1
incl ecxr2 ecx
10
1
subl eax, ecxr1 eax ; r2 ecx
0
1
orl $31, eax 1r1 eax0
20
Total 4Total 40
Grand Total 44
r1 Mr1 ; r2 Mr2
Switching CostsRM or MR : 10
Instruction costsIf accesses memory, 10
Else, 1
Register Map SelectionOptimal Solution
![Page 21: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/21.jpg)
Register Map Selection
• Use Dynamic Programming– near-optimal solution– account for translations spanning
multiple instructions– simultaneously perform instruction-
selection and register-mapping
![Page 22: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/22.jpg)
Talk Outline
SuperoptimizationPeephole SuperoptimizersApplication to Binary TranslationImplementation & Experimental
ResultsConclusion
![Page 23: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/23.jpg)
Powerpc X86 Translator Implementation
• Superoptimizer– Use a PPC emulator (Qemu) for execution
test– Use a SAT solver (zChaff) for boolean test
• Static user-level translator– ELF 32-bit ppc/Linux binary ELF 32-bit
x86/Linux binary– Translate most (but not all) system calls
![Page 24: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/24.jpg)
Implementation
Endianness: ppc big-endian ; x86 little-endian
– Convert all memory writes to big-endian (source)
– Convert all memory reads to little-endian (dest)
Compiler Optimizations– Problem:PowerPC optimizer staggers data-
dependent instructions to reduce pipeline stalls
– Solution: Cluster data-dependent instructions in basic block before translation
• Many Issues– Condition Codes, Endianness, System Calls,
Stack and Heap, Indirect Jumps, Function Calls and Returns, Register Name Constraints, Untranslated Opcodes, Compiler Optimizations
![Page 25: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/25.jpg)
Experimental Results• Setup
– Pentium4 3.0 GHz, 1MB Cache, 4GB Memory– gcc 4.0.1, glibc 2.3.6– Use soft-float library– Statically-linked input executables
• Benchmarks– Microbenchmarks, SPEC CINT2000
• Metrics– Compare against natively-compiled code– Compare against other binary translators
• Qemu, Apple’s Rosetta
![Page 26: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/26.jpg)
Experimental Setup
• For our experiments– there are around 750 translation rules
in the peephole table– the translation table is computed
offline and it can take up to a week to compute the peephole rules
![Page 27: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/27.jpg)
Experimental Results:Setup
C source
PowerPCexecutable
x86executable
gcc <options> -arch=ppc gcc <options> -arch=x86
Peephole Binary Translation
x86executable
Compare
![Page 28: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/28.jpg)
Microbenchmarks
emptyloop A bounded for-loop doing nothing
fibo Compute first few fibonacci numbers
quicksort Quicksort on 64-bit integers
mergesort Mergesort on 64-bit integers
bubblesort Bubblesort on 64-bit integers
hanoi1 Towers of Hanoi Algorithm 1
hanoi2 Towers of Hanoi Algorithm 2
hanoi3 Towers of Hanoi Algorithm 3
traverse Traverse a linked list
binsearch Binary search on a sorted array
![Page 29: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/29.jpg)
Microbenchmarks99 11
9
81 83
75
85
107
81
69
65
319
93 92
71 70
140
90
68
61
127
128
90
84
65 62
144
80
67
62
129
0
10
20
30
40
50
60
70
80
90
100em
ptyl
oop
fibo
quic
ksor
t
mer
geso
rt
bubs
ort
hano
i1
hano
i2
hano
i3
trav
erse
bins
earc
h
O0 O2 O2 -omit-f rame-pointer
Perc
enta
ge o
f nati
ve (
%)
avg: 90% of native
![Page 30: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/30.jpg)
Experimental Results: Microbenchmarks
• We sometimes outperform native performance on these small benchmarks!– gcc generates better code for
powerpc primarily because it has the luxury of many registers
– Our register-mapping algorithm performs an efficient “re-allocation” of the PowerPC registers to x86 registers.
![Page 31: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/31.jpg)
Experimental Results:SPEC CINT2000
66
53
66
87
59
167
4243
57
95
67
153
74
0
10
20
30
40
50
60
70
80
90
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
O0 O2
Perc
enta
ge o
f nati
ve (
%)
![Page 32: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/32.jpg)
Comparisons with Qemu and Rosetta
• Qemu– Use same PowerPC and x86 executables as used
for our own translator
• Rosetta– Runs on Mac OS X and hence supports on Mac
executables– Recompiled the benchmarks on Mac using the
same compiler version (gcc 4.0.1)– Mac Hardware: Intel Core 2 Duo 1.83GHz
processor, 32KB L1-cache, 2MB L2-cache and 2GB memory
![Page 33: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/33.jpg)
Comparisons with Qemu and Rosetta
18
12 15
48
16
55
11
65
59
85
54
43
66
53
66
87
59
167
42
0102030405060708090
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
-O0 -O2
avg: 3% faster than rosetta avg: 12% faster than rosetta
25
13
22
64
21
58
54 53
82
49
74
43
57
95
67
153
010
20304050
607080
90100
bzip
2
gap
gzip
mcf
pars
er
twol
f
qemu rosetta peep
![Page 34: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/34.jpg)
Translation Time• Takes 2-6 minutes to translate a 650KB
executable (around 100K instructions)– majority of time spent in optimal register map
computation
• It is possible to reduce this to <10 seconds– For 98K instructions (<0.01% of time), use any
register map. Fast (<1second)– For other 2K, use optimal computation
![Page 35: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/35.jpg)
Conclusions and Future Work
• A scheme to perform efficient binary translation using a superoptimizer– Competitive performance– Simplified Design
• Other applications– Just-in-time compilation– Machine virtualization
![Page 36: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/36.jpg)
Q&A Thank you.
![Page 37: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/37.jpg)
Backup Slides
![Page 38: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/38.jpg)
Experimental Results:Variants of peep
• No-reorder (aka do not “de-optimize”)– Do not reorder instructions inside basic
block (recall compiler optimizations)
• With-profile– Profile executables in a separate offline
phase. Use this data to determine predecessor weights during register mapping
![Page 39: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/39.jpg)
Comparisons with Variants of peep
67
52
70
87
59
145
42
66
53
66
87
59
167
42
66
53
69
86
60
167
43
0102030405060708090
100
bzip
2
gap
gzip
mcf
pars
er
twol
f
vort
ex
-O0 -O2
No-reorder: 0% slowerWith-profile: 0.56% faster
62
42
54
90
62
129
74
43
57
95
67
153
76
43
59
94
68
158
010
20304050
607080
90100
bzip
2
gap
gzip
mcf
pars
er
twol
f
no-reorder with-profiledefault
No-reorder: 6.9% slowerWith-profile: 1.4 % faster
![Page 40: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/40.jpg)
Hardware Software
Intel Pentium
PowerPC
We need a binary translator
![Page 41: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/41.jpg)
Our Approach
Use only peephole rules to translate
E.g.,
ld [r2]addi 1st [r2]
r2 er3 inc [er3]
RISC CISC
![Page 42: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/42.jpg)
Superoptimization
• Code generator that attempts to generate the optimal code to compute a given function– Use brute force search
Eg. int signum(int x) { if (x > 0) return 1; if (x < 0) return –1; else return 0;}
On Motorola 68020: add.l d0, d0 subx.l d1, d1 negx.l d0 addx.l d1, d1
![Page 43: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/43.jpg)
Superoptimization:Previous Work
• H. Massalin, 1987– Motorola 68020– Reached sequences lengths 12 (very
specialized)
• GNU Superoptimizer (GSO), 1992– Portable and efficient– Used for eliminating branches– Reached sequence lengths of 4
• Denali, HP Labs, 2002– Goal-Directed Superoptimization using
Theorem Provers
![Page 44: Binary Translation Using Peephole Superoptimizers](https://reader035.fdocuments.us/reader035/viewer/2022062309/568140a6550346895dac6177/html5/thumbnails/44.jpg)
Binary Translation
Use pattern-matching rules to translate code from one architecture to another
SimpleEfficientPortable