Building openSUSE with link-time optimizationshubicka/slides/opensuse2018-e.pdf · 2018. 5. 27. ·...

Post on 31-Dec-2020

1 views 0 download

Transcript of Building openSUSE with link-time optimizationshubicka/slides/opensuse2018-e.pdf · 2018. 5. 27. ·...

Jan Hubička and Martin LiškaSUSElabs

jh@suse.cz, mliska@suse.cz

Building openSUSE with link-time optimizations

Outlilne

● What is link-time optimization? ● Link-time optimization and GCC● Benchmarks● Can we build openSUSE with link-time optimization by

default?

What is link-time optimization?

Per-file compilation

File1.c

File2.c

File3.c

File4.c

File1.o

File2.o

File3.o

File4.o

GCC

GCC

GCC

GCC

ld / gold a.out

Link-time optimization

File1.c

File2.c

File3.c

File4.c

File1.oIL

File2.oIL

File3.oIL

File4.oIL

GCC

GCC

GCC

GCC

ld / gold a.out

LTO plug-inlink-timecompiler

Benefits of LTO● Symbol promotion

(from linker’s resolution data most symbols become “static”)● Cross-module inlining, constant propagation● Aggressive unreachable code removal● Profile propagation● EH optimization (propagation of “nothrow”)● Identical code folding● Optimized code layout● and more :)

Problems of LTO

● Whole toolchain has to be restructured● Slow compile-edit cycle● Harder bugreports

(often whole program needed to reproduce issue)● Not 100% transparent to user, but in most cases all one needs

to do is to add -flto

Link-time optimization and GCC

Modernizing GCC (LTO perspective)● 1999 (GCC 2.95): Function-at-a-time● 2001 (GCC 3.0): New inliner (first high-level opt . in gcc)● 2004 (GCC 3.4): Unit-at-a-time; intermodule compilation for C; Inter-procedural

optimization framework● 2005 (GCC 4.0): New SSA optimization framework● 2006 (GCC 4.1): Inter-procedural optimizations: profile guided inlining, pure/const

discovery, mod/ref, inter-procedural constant propagation, --fwhole-program –combine

● 2008 (GCC 4.4): Inter-procedural optimization on SSA; early optimization and inlining.● 2010 (GCC 4.5): Basic LTO framework (5 years in development)● 2011 (GCC 4.6): WHOPR (parallel link-time optimization); Firefox builds

Link-time optimization

File1.c

File2.c

File3.c

File4.c

File1.oIL

File2.oIL

File3.oIL

File4.oIL

GCC

GCC

GCC

GCC

ld / gold a.out

LTO plug-inlink-timecompiler

Parallelized Link-time optimization (WHOPR)

File1.c

File2.c

File3.c

File4.c

File1.oIL

File2.oIL

File3.oIL

File4.oIL

GCC

GCC

GCC

GCC

ld / gold a.out

LTO plguin

Whole ProgramAnalysis

Local opt.

Local opt

Local opt.

Modernizing GCC (LTO perspective)● 2012 (GCC 4.7.0): Memory use optimizations, new inliner heuristics, new inter-procedural

constant propagation with clonning● 2013 (GCC 4.8.0): symbol table; propagation of values passed through aggregates● 2014 (GCC 4.9.0): slim LTO objects by default; on demand loading of functions; devirtualization

pass; feedback directed code layout● 2015 (GCC 5): Identical Code Folding; COMDAT optimization; One Definition Rule for C++;

alignment propagation; correct command line options handling with LTO● 2016 (GCC 6): Linker-plugin now detects type of output binary. C&Fortran type merging. Better

alias anaysis● 2017 (GCC 7): Inter-procedural value range propagaion; bitwise propagation● 2018 (GCC 8): Early debug info. Profile representation rewrite. Function splitting now by default.

Reworked runtime estimation; Malloc attribute propagation

GCC optimization pipeline

Parser

IL generation

Early opts:

Early InlinerConstant prop.Forwward prop.Jump threading

Scalar repl. of aggr.Alias analysis

Redundancy ellim.Dead store ellim.Dead code ellim.

Tail recursionSwitch conversionpure/const/nothrow

EH optimizationProfile guessing

Compile time Link-time serial

IP analysisstreaming out

Symbol & typestreaming in+ merging

Inter-procedural(whole program)

Opts:

Dead symbol ellim.Symbol promotion

profile analysisIdentical code folding

devirtualizationConstant propagationconst/destr merging

Inliningpure/const/nothrow

mod/refcomdat

Partitioningstreaming out

Streaming in symbols, types and declarations

& link

Stream inand apply

transformations

High level opts:

Constant prop.Complette unrollForward prop.Alias analysis

Return slot opt.Redudancy ellim.Jump threading

Dead code ellim.Conditional store ellim.

Copy prop.If combine

Tail recursionCopy loop headersScalar repl. of aggr.

Dead store ellim.Dead code ellim.

ReassociationSincos, bswap opt.

Loop invariant motionPartial redundancy ellim.

Loop splittingUnroll and jam

Loop dsitributionLoop interchange ...

Low level opts:

Common subexpression ellim.Forward propagation

Copy propagationPartial Redundancy Ellim.

Code hoistingCopy propagation

Store motionIf conversion

Loop invariant motionLoop unrolling

Doloop optimizationWeb constructionCopy propagation

Common subexpression ellim.Dead store ellim.

Instruction combineFunction partitioningInstruction splitting

Live range shrinkingScheduling

Register allocationGlobal common subexpr. Ellim.

Shrink wrappingStack adjustment opt.

Register renamingConstant prop.

Code reorderingScheduling

X87 register stackCode/data alignment

Machine dependent reorg.Code output

Link-time parallel

Benchmarks

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

0

0.5

1

1.5

2

2.5

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

0

2

4

6

8

10

12

14

generic

native

429.

mcf

458

.sjen

g

445

.gob

mk

403

.gcc

462

.libqu

antu

m

473

.ast

ar

464

.h26

4ref

471

.om

netp

p

401

.bzip

2

483

.xala

ncbm

k

400

.per

lbenc

h

456

.hm

mer

Geo

mea

n

-20

0

20

40

60

80

100

120

GCC -Ofast relative to GCC 6 -O2

GCC 6

GCC 7

GCC 8

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

0

2

4

6

8

10

12

14

generic

native

400.

perlb

ench

401

.bzip

2

403

.gcc

429

.mcf

445

.gob

mk

456

.hm

mer

458

.sjen

g

462

.libqu

antu

m

464

.h26

4ref

471

.om

netp

p

473

.ast

ar

483

.xala

ncbm

k

Geo

mea

n

-50

-40

-30

-20

-10

0

10

Clang & ICC -Ofast relative to GCC 8

clang/flang 6

ICC 18

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

0

2

4

6

8

10

12

14

generic

native

400.

perlb

ench

401

.bzip

2

403

.gcc

429

.mcf

445

.gob

mk

456

.hm

mer

458

.sjen

g

462

.libqu

antu

m

464

.h26

4ref

471

.om

netp

p

473

.ast

ar

483

.xala

ncbm

k

Geo

mea

n

-4

-2

0

2

4

6

8

GCC -Ofast -flto relative to -Ofast

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

0

2

4

6

8

10

12

14

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

clang

/fllan

g 6

-Ofa

st -f

lto

ICC 1

8 -O

fast

-flto

0

5

10

15

20

25

generic

native

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

clang

/fllan

g 6

-Ofa

st -f

lto

ICC 1

8 -O

fast

-flto

0

5

10

15

20

25

generic

native

400.

perlb

ench

401

.bzip

2

403

.gcc

429

.mcf

445

.gob

mk

456

.hm

mer

458

.sjen

g

462

.libqu

antu

m

464

.h26

4ref

471

.om

netp

p

473

.ast

ar

483

.xala

ncbm

k

Geo

mea

n

-40

-20

0

20

40

60

80

100

120

Clang/flang 6 and ICC 18 relative to GCC 8 (-O2 -flto)

clang/flang 6

ICC 18

SPECint 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

clang

/flan

g 6

-Ofa

st

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

clang

/fllan

g 6

-Ofa

st -f

lto

ICC 1

8 -O

fast

-flto

0

5

10

15

20

25

generic

native

Profile feedback & LTO works well together

Hmmer benchmark has problems; was excluded.

To build with profile feedback often you can use: ./configure ; CFLAGS=”-O2 -fprofile-generate” make ; make check ; make clean ; CFLAGS=”-O2 -fprofile-use” make

400.

perlb

ench

401

.bzip

2

403

.gcc

429

.mcf

445

.gob

mk

458

.sjen

g

462

.libqu

antu

m

464

.h26

4ref

471

.om

netp

p

473

.ast

ar

483

.xala

ncbm

k

Geo

mea

n

-10

-5

0

5

10

15

20

25

Performance relative to GCC 8 -Ofast

LTO

FDO

FDO+LTO

SPECint 2006 code size

GCC 7 -O2 GCC 8 -O2 GCC 7 -Ofast GCC 8 -OfastGCC 8 -Ofast + FDO Clang 6 -O2 Clang 6 -Ofast Icc 18 -Ofast0

2000000

4000000

6000000

8000000

10000000

12000000

Non-LTO

LTO

SPECfp 2006 performance (relative to GCC 6)

GCC 7 -O

2

GCC 8 -O

2

GCC 6 -O

fast

GCC 7 -O

fast

GCC 8 -O

fast

ICC 1

8 -O

fast

GCC 8 -O

2 -fl

to

GCC 8 -O

fast

lto

ICC 1

8 -O

fast

-flto

-10

0

10

20

30

40

50

60

generic

native

Firefox performance summary

resp

onsiv

enes

s tp

5o

drom

aeo

dom

Display

list m

utat

e

tap

paint

spee

dom

eter

a11y

r

svgr

_opa

city

tp5o

ARES6

style

benc

hta

rt

drom

aeo

css

tsvg

x

star

tup

time

-5

0

5

10

15

20

25

30

35

Firefox performance relative to non-LTO build

static

Profile feedback

Firefox performance – dromaeo DOM

Firefox performance – tp5o page responsiveness

Firefox binary size

clang6 -Oz -fltoclang6 -Oz -flto=thin

clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO

clang6 O3 -flto + FDOclang6 -O3 -flto

clang6 -O3 -flto=thin

clang6 -Ozclang6 -Osclang6 -O2

clang6 -O3 + FDOclang6 -O3

gcc8 -Os -fltogcc8 -O3 -flto + FDO

gcc8 -O2 -fltogcc8 -O3 -flto

gcc8 -Osgcc8 -O3 + FDO

gcc8 -O2gcc8 -O3

gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000

EH

data

relocations

text

Firefox binary size

clang6 -Oz -fltoclang6 -Oz -flto=thin

clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO

clang6 O3 -flto + FDOclang6 -O3 -flto

clang6 -O3 -flto=thin

clang6 -Ozclang6 -Osclang6 -O2

clang6 -O3 + FDOclang6 -O3

gcc8 -Os -fltogcc8 -O3 -flto + FDO

gcc8 -O2 -fltogcc8 -O3 -flto

gcc8 -Osgcc8 -O3 + FDO

gcc8 -O2gcc8 -O3

gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000

EH

data

relocations

text

Firefox binary size

clang6 -Oz -fltoclang6 -Oz -flto=thin

clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO

clang6 O3 -flto + FDOclang6 -O3 -flto

clang6 -O3 -flto=thin

clang6 -Ozclang6 -Osclang6 -O2

clang6 -O3 + FDOclang6 -O3

gcc8 -Os -fltogcc8 -O3 -flto + FDO

gcc8 -O2 -fltogcc8 -O3 -flto

gcc8 -Osgcc8 -O3 + FDO

gcc8 -O2gcc8 -O3

gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000

EH

data

relocations

text

Firefox binary size

clang6 -Oz -fltoclang6 -Oz -flto=thin

clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO

clang6 O3 -flto + FDOclang6 -O3 -flto

clang6 -O3 -flto=thin

clang6 -Ozclang6 -Osclang6 -O2

clang6 -O3 + FDOclang6 -O3

gcc8 -Os -fltogcc8 -O3 -flto + FDO

gcc8 -O2 -fltogcc8 -O3 -flto

gcc8 -Osgcc8 -O3 + FDO

gcc8 -O2gcc8 -O3

gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000

EH

data

relocations

text

Firefox binary size

clang6 -Oz -fltoclang6 -Oz -flto=thin

clang6 -O2 -fltoclang6 -O3 -flto=thin + FDO

clang6 O3 -flto + FDOclang6 -O3 -flto

clang6 -O3 -flto=thin

clang6 -Ozclang6 -Osclang6 -O2

clang6 -O3 + FDOclang6 -O3

gcc8 -Os -fltogcc8 -O3 -flto + FDO

gcc8 -O2 -fltogcc8 -O3 -flto

gcc8 -Osgcc8 -O3 + FDO

gcc8 -O2gcc8 -O3

gcc8 -O3 -flto + FDOgcc7 -O3 -flto + FDOgcc6 -O3 -flto + FDO

0 20000000 40000000 60000000 80000000 100000000 120000000 140000000

EH

data

relocations

text

Summary

● LTO now works and scale for large applications (Firefox, Libreoffice,…)

● Important size optimization especially for large C++ programs● Often important performance optimization especially when

combined with profile feedback● Space for future improvements

(both on GCC side and re-optimization of programs to LTO model)

LTO in openSUSE Factory

LTO in openSUSE Factory

● A new staging project (openSUSE:Factory:Staging:N)● Setting: Optflags: * -flto● ~80 packages failed (out of all ~2300)● Branch all failing packages and disable LTO● LTO Factory (w/ some disabled packages) can run openQA test-

suite (with a small fallout)

Package statistics

● Total ELF files in distribution: ~6700● Reduction: 1839 MB to 1736MB (-5.6%)● libmergedlo.so: reduction 10MB (-16%) ● Biggest reduction:

– mysql_upgrade (-92%)

– Innochecksum (-94%)

– mbstream (-94%)

Issues

● Argyllcms: lto1: fatal error: multiple prevailing defs for 'xcal_read_icc': LD PR: https://sourceware.org/PR23079

● GCC LTO miscompilation: http://gcc.gnu.org/PR85248● .symver in shared libraries: https://gcc.gnu.org/PR48200:

– __asm__(".symver old_foo,foo@VERS_1.1")

– New symver function attribute must be added (GCC 9.1.0)

Issues (cont.)

● static libraries (*.la):– e2fsprogs, btrfsprogs, …– Fat LTO objects must be used (-ffat-lto-objects)– LTO elf sections should be stripped– OBS sanitizer should be extended– LTO mode can combine both LTO objects and assembly objects

Issues (cont. 2)

● LTO warnings (-Wodr, ...):– ltrace: error: type of 'filter_matches_symbol' does not match

original declaration [-Werror=lto-type-mismatch]– gdb: error: type 'struct ipa_sym_addresses' violates the C++ One

Definition Rule [-Werror=odr]● mpir: weak configure script that scans *.o file as a blob● weak support for LTO debug info in dwz tool● Maybe higher memory constaints for selected packages

Issues (cont. 3)

● Usage of top-level assembler (https://gcc.gnu.org/PR57703):– Example of syscall.cc in Chromium project:

asm volatile(".text\n"

".align 16, 0x90\n"

".type SyscallAsm, @function\n"

"SyscallAsm:.cfi_startproc\n"

– Error:nacl_helper.ltrans1.ltrans.o: In function `playground2::SandboxSyscall(int, long, long, long, long, long, long)':

nacl_helper.ltrans1.o:(.text+0x4503): undefined reference to `SyscallAsm'

Histogram of text segment sizes

0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,0000

20

40

60

80

100

120

140

Conclusion

● LTO looks mature enough to be used by default● Do we want it for openSUSE Factory?● For the future, can we offer it also for SLE?

LicenseThis slide deck is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license. It can be shared and adapted for any purpose (even commercially) as long as Attribution is given and any derivative work is distributed under the same license.

Details can be found at https://creativecommons.org/licenses/by-sa/4.0/

General DisclaimerThis document is not to be construed as a promise by any participating organisation to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. openSUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for openSUSE products remains at the sole discretion of openSUSE. Further, openSUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All openSUSE marks referenced in this presentation are trademarks or registered trademarks of SUSE LLC, in the United States and other countries. All third-party trademarks are the property of their respective owners.

Credits

TemplateRichard Brown

rbrown@opensuse.org

Design & InspirationopenSUSE Design Team

http://opensuse.github.io/branding-guidelines/