Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf ·...
Transcript of Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf ·...
![Page 1: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/1.jpg)
![Page 3: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/3.jpg)
• Built in less than 6 months • x64 Linux ELF • Runs on large binary (HHVM, non-jitted part) • Improves I-Cache, ITLB, branch misses • Deployed to limited production
BOLT Binary Optimization and Layout Tool
![Page 4: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/4.jpg)
• Why a binary optimizer • Is LLVM the best choice? • Challenges • Approaches to implementation • Results • Future plans
Overview
![Page 5: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/5.jpg)
• No need to link sample-based profile data to source code or IR • Can optimize 3rd-party libraries without source code • Has “whole-program” view • Some optimizations could only be done to a binary
Why Binary Optimizer
![Page 6: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/6.jpg)
• HP ISpike • Microsoft Vulcan/BBT • Sun/Oracle Studio Binary Optimizer • Intel PIN • Dynamic binary optimizers • Many More
Existing Binary Optimizers and Binary Rewriters
![Page 7: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/7.jpg)
• perf record -b -e .... -a -- sleep 300 • perf2bolt perf.data -o perf.fdata -b hhvm • llvm-bolt -data=perf.fdata hhvm -o hhvm.bolt
Usage Model Example with HHVM binary running in production
![Page 8: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/8.jpg)
• Disassembler • Assembler • ... sharing the same representation
• ELFs, DWARFs, and ORCs
Why LLVM
![Page 9: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/9.jpg)
• Code discovery • Disassembly • CFG construction • Optimizations • Available storage discovery • Code (and data) emission
Implementation Overview
![Page 10: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/10.jpg)
• Symbol table • need unstripped binary
• .eh_frame• unwind info includes function boundaries
• No general problem solution • Don’t need to know everything to optimize • Relocations from the linker
Discovery Process Functions and Objects
![Page 11: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/11.jpg)
• Relocation reconstruction for code • %rip-relative addressing on x64 • Relocations for %rip operands • tblgen fixes required for some instructions
Disassembly
![Page 12: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/12.jpg)
• x86 binary -> MCInst with CFG -> ORC -> x86 binary • MCInst vs MachineInstruction• No higher than MachineInstruction• Conservative approach that works • Modify code that we 100% understand
CFG Construction
![Page 13: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/13.jpg)
• Feedback-directed basic block reordering (modified Pettis-Hansen) • Sample-based profiling with LBR • Can gather profile on a binary running in production • On top of the linker script that does function
placement
Optimizations
![Page 14: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/14.jpg)
• Pretend we are linking for jitting • Map address spaces for relocation processing • No prior allocation required • Tricky to relocate ELF program header table • Fix section header table
Allocating New Code and Data ELF-specific
![Page 15: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/15.jpg)
Ready to run?
![Page 16: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/16.jpg)
• .eh_frame updated with new CFIs • Heavy usage of RememberState/RestoreState• .eh_frame_hdr section and GNU_EH_FRAME
program header • .gcc_except_table with new call site table
C++ Exceptions IA64 “zero-cost”
![Page 17: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/17.jpg)
• No SpecCPU2006 • PHP JIT • github.com/facebook/hhvm • More components linked-in at FB • >100MB .text• ~4GB with debug info
Benchmark HHVM
![Page 18: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/18.jpg)
• Hot paths marked with __builtin_expect()• Hottest small functions written in assembly • Carefully tuned inlining • Linker script for function placement • Huge pages for code • <90% functions optimized by BOLT • Execution time split between binary and jitted code
Benchmark HHVM
![Page 19: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/19.jpg)
-1.00%
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
HHVM
![Page 20: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/20.jpg)
• WIP • .debug_info mostly unchanged • DW_AT_ranges replaces contiguous attributes • .debug_line rewritten and DW_AT_stmt_list updated • .debug_ranges, .debug_aranges modified • .debug_loc modified • More work with more optimizations
Updating Debug Information DWARF
![Page 21: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/21.jpg)
• Well-formed C/C++ • Properly marked assembly functions • Self-modifying code • Self-validating code • Not implemented • Multiple-entry functions • Switch tables
Limitations
![Page 22: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/22.jpg)
• Inlining • De-virtualization • Conditional tail-call • ABI-breaking optimizations • Remove unnecessary spills/reloads after analyzing call
chain
• Data reordering
Future Optimizations
![Page 23: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/23.jpg)
• Linker-style optimizations • ICF • Unreachable/dead-code (gc-sections) • Function re-ordering • 100% coverage • Replace linker script and optimizations • Move entry points • Integrate into dynamic engine
Future Plans
![Page 24: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/24.jpg)
• No direct comparison • Mixed results from AutoFDO when it works • BOLT is faster than running linker with linker script • The goal is to complement compiler and extract
every single bit of performance out of a binary
Compared to AutoFDO/LTO
![Page 25: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/25.jpg)
Example void foo(int c) { if (c > 0) { A; // macro A } else { B; // macro B }}
void bar() { ... foo(/* > 0*/); ...}
void baz() { ... foo(/* <= 0*/); ...}
![Page 26: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/26.jpg)
Example void foo(int c) { if (c > 0) { A; // macro A } else { B; // macro B }}
void bar() { ... foo(/* > 0*/); ...}
void baz() { ... foo(/* <= 0*/); ...}
1000 1000
![Page 27: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/27.jpg)
Example void foo(int c) { if (c > 0) { A; // macro A } else { B; // macro B }}
void bar() { ... foo(/* > 0*/); ...}
void baz() { ... foo(/* <= 0*/); ...}
1000 1000
1000
1000
![Page 28: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/28.jpg)
Example void foo(int c) { if (c > 0) { A; // macro A } else { B; // macro B }}
void bar() { ... .. A; // macro A .. B; // macro B .. ...}
1000
1000
1000
void baz() { ... .. A; // macro A .. B; // macro B .. ...}
1000
![Page 29: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/29.jpg)
Example void foo(int c) { if (c > 0) { A; // macro A } else { B; // macro B }}
void bar() { ... A; // macro A ...}bar.cold { .. B; // macro B ..}
1000
1000
1000
void baz() { ... B; // macro B ...}baz.cold { .. A; // macro A ..}
1000
1000 1000
![Page 30: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/30.jpg)
• LLVM community
• Rafael Auler - Facebook intern • Gabriel Poesia - Facebook intern
Thank You!
![Page 31: Building Binary Optimizer - LLVMllvm.org/devmtg/2016-03/Presentations/BOLT_EuroLLVM_2016.pdf · 2019-10-30 · • Built in less than 6 months • x64 Linux ELF • Runs on large](https://reader033.fdocuments.us/reader033/viewer/2022060223/5f07d6da7e708231d41f02b3/html5/thumbnails/31.jpg)