Post on 31-Dec-2020
Jan Hubička and Martin LiškaSUSElabs
jh@suse.cz, mliska@suse.cz
Building openSUSE with link-time optimizations
Outlilne
● What is link-time optimization? ● Link-time optimization and GCC● Benchmarks● Can we build openSUSE with link-time optimization by
default?
What is link-time optimization?
Per-file compilation
File1.c
File2.c
File3.c
File4.c
File1.o
File2.o
File3.o
File4.o
GCC
GCC
GCC
GCC
ld / gold a.out
Link-time optimization
File1.c
File2.c
File3.c
File4.c
File1.oIL
File2.oIL
File3.oIL
File4.oIL
GCC
GCC
GCC
GCC
ld / gold a.out
LTO plug-inlink-timecompiler
Benefits of LTO● Symbol promotion
(from linker’s resolution data most symbols become “static”)● Cross-module inlining, constant propagation● Aggressive unreachable code removal● Profile propagation● EH optimization (propagation of “nothrow”)● Identical code folding● Optimized code layout● and more :)
Problems of LTO
● Whole toolchain has to be restructured● Slow compile-edit cycle● Harder bugreports
(often whole program needed to reproduce issue)● Not 100% transparent to user, but in most cases all one needs
to do is to add -flto
Link-time optimization and GCC
Modernizing GCC (LTO perspective)● 1999 (GCC 2.95): Function-at-a-time● 2001 (GCC 3.0): New inliner (first high-level opt . in gcc)● 2004 (GCC 3.4): Unit-at-a-time; intermodule compilation for C; Inter-procedural
optimization framework● 2005 (GCC 4.0): New SSA optimization framework● 2006 (GCC 4.1): Inter-procedural optimizations: profile guided inlining, pure/const
discovery, mod/ref, inter-procedural constant propagation, --fwhole-program –combine
● 2008 (GCC 4.4): Inter-procedural optimization on SSA; early optimization and inlining.● 2010 (GCC 4.5): Basic LTO framework (5 years in development)● 2011 (GCC 4.6): WHOPR (parallel link-time optimization); Firefox builds
Link-time optimization
File1.c
File2.c
File3.c
File4.c
File1.oIL
File2.oIL
File3.oIL
File4.oIL
GCC
GCC
GCC
GCC
ld / gold a.out
LTO plug-inlink-timecompiler
Parallelized Link-time optimization (WHOPR)
File1.c
File2.c
File3.c
File4.c
File1.oIL
File2.oIL
File3.oIL
File4.oIL
GCC
GCC
GCC
GCC
ld / gold a.out
LTO plguin
Whole ProgramAnalysis
Local opt.
Local opt
Local opt.
Modernizing GCC (LTO perspective)● 2012 (GCC 4.7.0): Memory use optimizations, new inliner heuristics, new inter-procedural
constant propagation with clonning● 2013 (GCC 4.8.0): symbol table; propagation of values passed through aggregates● 2014 (GCC 4.9.0): slim LTO objects by default; on demand loading of functions; devirtualization
pass; feedback directed code layout● 2015 (GCC 5): Identical Code Folding; COMDAT optimization; One Definition Rule for C++;
alignment propagation; correct command line options handling with LTO● 2016 (GCC 6): Linker-plugin now detects type of output binary. C&Fortran type merging. Better
alias anaysis● 2017 (GCC 7): Inter-procedural value range propagaion; bitwise propagation● 2018 (GCC 8): Early debug info. Profile representation rewrite. Function splitting now by default.
Reworked runtime estimation; Malloc attribute propagation
GCC optimization pipeline
Parser
IL generation
Early opts:
Early InlinerConstant prop.Forwward prop.Jump threading
Scalar repl. of aggr.Alias analysis
Redundancy ellim.Dead store ellim.Dead code ellim.
Tail recursionSwitch conversionpure/const/nothrow
EH optimizationProfile guessing
Compile time Link-time serial
IP analysisstreaming out
Symbol & typestreaming in+ merging
Inter-procedural(whole program)
Opts:
Dead symbol ellim.Symbol promotion
profile analysisIdentical code folding
devirtualizationConstant propagationconst/destr merging
Inliningpure/const/nothrow
mod/refcomdat
Partitioningstreaming out
Streaming in symbols, types and declarations
& link
Stream inand apply
transformations
High level opts:
Constant prop.Complette unrollForward prop.Alias analysis
Return slot opt.Redudancy ellim.Jump threading
Dead code ellim.Conditional store ellim.
Copy prop.If combine
Tail recursionCopy loop headersScalar repl. of aggr.
Dead store ellim.Dead code ellim.
ReassociationSincos, bswap opt.
Loop invariant motionPartial redundancy ellim.
Loop splittingUnroll and jam
Loop dsitributionLoop interchange ...
Low level opts:
Common subexpression ellim.Forward propagation
Copy propagationPartial Redundancy Ellim.
Code hoistingCopy propagation
Store motionIf conversion
Loop invariant motionLoop unrolling
Doloop optimizationWeb constructionCopy propagation
Common subexpression ellim.Dead store ellim.
Instruction combineFunction partitioningInstruction splitting
Live range shrinkingScheduling
Register allocationGlobal common subexpr. Ellim.
Shrink wrappingStack adjustment opt.
Register renamingConstant prop.
Code reorderingScheduling
X87 register stackCode/data alignment
Machine dependent reorg.Code output
Link-time parallel
Benchmarks
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
0
0.5
1
1.5
2
2.5
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
0
2
4
6
8
10
12
14
generic
native
429.
mcf
458
.sjen
g
445
.gob
mk
403
.gcc
462
.libqu
antu
m
473
.ast
ar
464
.h26
4ref
471
.om
netp
p
401
.bzip
2
483
.xala
ncbm
k
400
.per
lbenc
h
456
.hm
mer
Geo
mea
n
-20
0
20
40
60
80
100
120
GCC -Ofast relative to GCC 6 -O2
GCC 6
GCC 7
GCC 8
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
0
2
4
6
8
10
12
14
generic
native
400.
perlb
ench
401
.bzip
2
403
.gcc
429
.mcf
445
.gob
mk
456
.hm
mer
458
.sjen
g
462
.libqu
antu
m
464
.h26
4ref
471
.om
netp
p
473
.ast
ar
483
.xala
ncbm
k
Geo
mea
n
-50
-40
-30
-20
-10
0
10
Clang & ICC -Ofast relative to GCC 8
clang/flang 6
ICC 18
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
0
2
4
6
8
10
12
14
generic
native
400.
perlb
ench
401
.bzip
2
403
.gcc
429
.mcf
445
.gob
mk
456
.hm
mer
458
.sjen
g
462
.libqu
antu
m
464
.h26
4ref
471
.om
netp
p
473
.ast
ar
483
.xala
ncbm
k
Geo
mea
n
-4
-2
0
2
4
6
8
GCC -Ofast -flto relative to -Ofast
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
0
2
4
6
8
10
12
14
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
clang
/fllan
g 6
-Ofa
st -f
lto
ICC 1
8 -O
fast
-flto
0
5
10
15
20
25
generic
native
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
clang
/fllan
g 6
-Ofa
st -f
lto
ICC 1
8 -O
fast
-flto
0
5
10
15
20
25
generic
native
400.
perlb
ench
401
.bzip
2
403
.gcc
429
.mcf
445
.gob
mk
456
.hm
mer
458
.sjen
g
462
.libqu
antu
m
464
.h26
4ref
471
.om
netp
p
473
.ast
ar
483
.xala
ncbm
k
Geo
mea
n
-40
-20
0
20
40
60
80
100
120
Clang/flang 6 and ICC 18 relative to GCC 8 (-O2 -flto)
clang/flang 6
ICC 18
SPECint 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
clang
/flan
g 6
-Ofa
st
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
clang
/fllan
g 6
-Ofa
st -f
lto
ICC 1
8 -O
fast
-flto
0
5
10
15
20
25
generic
native
Profile feedback & LTO works well together
Hmmer benchmark has problems; was excluded.
To build with profile feedback often you can use: ./configure ; CFLAGS=”-O2 -fprofile-generate” make ; make check ; make clean ; CFLAGS=”-O2 -fprofile-use” make
400.
perlb
ench
401
.bzip
2
403
.gcc
429
.mcf
445
.gob
mk
458
.sjen
g
462
.libqu
antu
m
464
.h26
4ref
471
.om
netp
p
473
.ast
ar
483
.xala
ncbm
k
Geo
mea
n
-10
-5
0
5
10
15
20
25
Performance relative to GCC 8 -Ofast
LTO
FDO
FDO+LTO
SPECint 2006 code size
GCC 7 -O2 GCC 8 -O2 GCC 7 -Ofast GCC 8 -OfastGCC 8 -Ofast + FDO Clang 6 -O2 Clang 6 -Ofast Icc 18 -Ofast0
2000000
4000000
6000000
8000000
10000000
12000000
Non-LTO
LTO
SPECfp 2006 performance (relative to GCC 6)
GCC 7 -O
2
GCC 8 -O
2
GCC 6 -O
fast
GCC 7 -O
fast
GCC 8 -O
fast
ICC 1
8 -O
fast
GCC 8 -O
2 -fl
to
GCC 8 -O
fast
lto
ICC 1
8 -O
fast
-flto
-10
0
10
20
30
40
50
60
generic
native
Firefox performance summary
resp
onsiv
enes
s tp
5o
drom
aeo
dom
Display
list m
utat
e
tap
paint
spee
dom
eter
a11y
r
svgr
_opa
city
tp5o
ARES6
style
benc
hta
rt
drom
aeo
css
tsvg
x
star
tup
time
-5
0
5
10
15
20
25
30
35
Firefox performance relative to non-LTO build
static
Profile feedback
Firefox performance – dromaeo DOM
Firefox performance – tp5o page responsiveness
Firefox binary size
clang6 -Oz -fltoclang6 -Oz -flto=thin
clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO
clang6 O3 -flto + FDOclang6 -O3 -flto
clang6 -O3 -flto=thin
clang6 -Ozclang6 -Osclang6 -O2
clang6 -O3 + FDOclang6 -O3
gcc8 -Os -fltogcc8 -O3 -flto + FDO
gcc8 -O2 -fltogcc8 -O3 -flto
gcc8 -Osgcc8 -O3 + FDO
gcc8 -O2gcc8 -O3
gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000
EH
data
relocations
text
Firefox binary size
clang6 -Oz -fltoclang6 -Oz -flto=thin
clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO
clang6 O3 -flto + FDOclang6 -O3 -flto
clang6 -O3 -flto=thin
clang6 -Ozclang6 -Osclang6 -O2
clang6 -O3 + FDOclang6 -O3
gcc8 -Os -fltogcc8 -O3 -flto + FDO
gcc8 -O2 -fltogcc8 -O3 -flto
gcc8 -Osgcc8 -O3 + FDO
gcc8 -O2gcc8 -O3
gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000
EH
data
relocations
text
Firefox binary size
clang6 -Oz -fltoclang6 -Oz -flto=thin
clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO
clang6 O3 -flto + FDOclang6 -O3 -flto
clang6 -O3 -flto=thin
clang6 -Ozclang6 -Osclang6 -O2
clang6 -O3 + FDOclang6 -O3
gcc8 -Os -fltogcc8 -O3 -flto + FDO
gcc8 -O2 -fltogcc8 -O3 -flto
gcc8 -Osgcc8 -O3 + FDO
gcc8 -O2gcc8 -O3
gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000
EH
data
relocations
text
Firefox binary size
clang6 -Oz -fltoclang6 -Oz -flto=thin
clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO
clang6 O3 -flto + FDOclang6 -O3 -flto
clang6 -O3 -flto=thin
clang6 -Ozclang6 -Osclang6 -O2
clang6 -O3 + FDOclang6 -O3
gcc8 -Os -fltogcc8 -O3 -flto + FDO
gcc8 -O2 -fltogcc8 -O3 -flto
gcc8 -Osgcc8 -O3 + FDO
gcc8 -O2gcc8 -O3
gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000
EH
data
relocations
text
Firefox binary size
clang6 -Oz -fltoclang6 -Oz -flto=thin
clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO
clang6 O3 -flto + FDOclang6 -O3 -flto
clang6 -O3 -flto=thin
clang6 -Ozclang6 -Osclang6 -O2
clang6 -O3 + FDOclang6 -O3
gcc8 -Os -fltogcc8 -O3 -flto + FDO
gcc8 -O2 -fltogcc8 -O3 -flto
gcc8 -Osgcc8 -O3 + FDO
gcc8 -O2gcc8 -O3
gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000
EH
data
relocations
text
Summary
● LTO now works and scale for large applications (Firefox, Libreoffice,…)
● Important size optimization especially for large C++ programs● Often important performance optimization especially when
combined with profile feedback● Space for future improvements
(both on GCC side and re-optimization of programs to LTO model)
LTO in openSUSE Factory
LTO in openSUSE Factory
● A new staging project (openSUSE:Factory:Staging:N)● Setting: Optflags: * -flto● ~80 packages failed (out of all ~2300)● Branch all failing packages and disable LTO● LTO Factory (w/ some disabled packages) can run openQA test-
suite (with a small fallout)
Package statistics
● Total ELF files in distribution: ~6700● Reduction: 1839 MB to 1736MB (-5.6%)● libmergedlo.so: reduction 10MB (-16%) ● Biggest reduction:
– mysql_upgrade (-92%)
– Innochecksum (-94%)
– mbstream (-94%)
Issues
● Argyllcms: lto1: fatal error: multiple prevailing defs for 'xcal_read_icc': LD PR: https://sourceware.org/PR23079
● GCC LTO miscompilation: http://gcc.gnu.org/PR85248● .symver in shared libraries: https://gcc.gnu.org/PR48200:
– __asm__(".symver old_foo,foo@VERS_1.1")
– New symver function attribute must be added (GCC 9.1.0)
Issues (cont.)
● static libraries (*.la):– e2fsprogs, btrfsprogs, …– Fat LTO objects must be used (-ffat-lto-objects)– LTO elf sections should be stripped– OBS sanitizer should be extended– LTO mode can combine both LTO objects and assembly objects
Issues (cont. 2)
● LTO warnings (-Wodr, ...):– ltrace: error: type of 'filter_matches_symbol' does not match
original declaration [-Werror=lto-type-mismatch]– gdb: error: type 'struct ipa_sym_addresses' violates the C++ One
Definition Rule [-Werror=odr]● mpir: weak configure script that scans *.o file as a blob● weak support for LTO debug info in dwz tool● Maybe higher memory constaints for selected packages
Issues (cont. 3)
● Usage of top-level assembler (https://gcc.gnu.org/PR57703):– Example of syscall.cc in Chromium project:
asm volatile(".text\n"
".align 16, 0x90\n"
".type SyscallAsm, @function\n"
"SyscallAsm:.cfi_startproc\n"
– Error:nacl_helper.ltrans1.ltrans.o: In function `playground2::SandboxSyscall(int, long, long, long, long, long, long)':
nacl_helper.ltrans1.o:(.text+0x4503): undefined reference to `SyscallAsm'
Histogram of text segment sizes
0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,0000
20
40
60
80
100
120
140
Conclusion
● LTO looks mature enough to be used by default● Do we want it for openSUSE Factory?● For the future, can we offer it also for SLE?
LicenseThis slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license.
Details can be found at https://creativecommons.org/licenses/by-sa/4.0/
General DisclaimerThis document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.
Credits
TemplateRichard Brown
rbrown@opensuse.org
Design & InspirationopenSUSE Design Team
http://opensuse.github.io/branding-guidelines/