August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation...

24
August August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation...

Page 1: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August

Code Compaction for UniCoreon Link-Time Optimization Platform

Zhang JiyuCompilation Toolchain Group

MPRC

Page 2: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Compilation Process

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code

• *.asm• *.s• Linking Scripts• ......

21 3

Object Files

• *.o• *.a• *.so• ……

Executable

• DLL• Executable

Profile Data

• Execute Frequency

• Traces

• ……

54 6

Coding Compile

Assemble Linking Execute

Profile Guided Optimization

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code

• *.asm• *.s• Linking Scripts• ......

21 3

Object Files

• *.o• *.a• *.so• ……

Executable

• DLL• Executable

Profile Data

• Execute Frequency

• Traces

• ……

54 6

Coding Compile

Assemble Linking Execute

Profile Guided Optimization

Page 3: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Our Optimization Process

Design

• Ideas

Source Code

• *.cpp• *.c• *.h• Makefile• ......

Assembly Code

• *.asm• *.s• Linking Scripts• ......

21 3

Object Files

• *.o• *.a• *.so• ……

Executable

• DLL• Executable

Profile Data

• Execute Frequency

• Traces

• ……

54 6

Coding Compile

Assemble Linking &Link-Time Optimization

Execute

Profile Guided Optimization

Profile Guided Optimization

Page 4: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

CLOU is a Link-time Optimizer for UniCore

CodeCode

DataData

MetaMeta

CodeCode

DataData

MetaMeta

CodeCode

DataData

MetaMeta

CodeCode

DataData

CodeCode

DataData

CodeCode

DataData

DataData

DataData

DataData

Translation to IR

Translation to IR

CFG construction&

Optimizations

CFG construction&

Optimizations

ExecExec

Layout; AssemblingLayout; Assembling

LinkingLinking

A Graph Modified From Diablo

Page 5: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Code Compaction based on CLOU

• Motivation of code compaction– Limited memory and energy resources for embedded systems– Code density affects both memory and energy consumption

• Goal: reducing code size without losing performance• Code compaction in different levels

1. Typical optimizations for code size reduction at link-time

2. Hot/cold code splitting

3. New mixed code generation method

Code

Cold Code

Hot Code

Hot Code

Cold CodeCode

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Hot Code

Cold Code

Code

Cold Code

Hot Code

Cold Code

Cold Code

1 2 3

Page 6: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Typical Optimizations for Code Size Reduction• Redundant code elimination

– Computations whose results have been computed previously and are guaranteed to be available at that point

• Unreachable code elimination– Code fragments which there is no control flow path to from the

entry node– Many of them are following useless comparisons

• Dead code elimination– Computations whose results are never used

• Peephole optimization• Procedural abstraction -- might lead to performance loss

Page 7: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Experiments for Typical Optimizations for Code Size Reduction

• Benchmark: Mediabench

• Code size reduction– Average: 12.8%– Max: 22.3%

• Performance improvement– Average: 2.4%– Max: 4.2%

Page 8: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

• Less code transferred from remote to local, from disk to memory, or from memory to cache

– Question: might be too conservative or lead to performance loss?

• Get hot/cold code splitted through basic block reordering

Hot/Cold Code Splitting

Condition

2

Hot Code Cold Code

More Code

Code1

Hot Code

More Code

Cold Code

Condition

Code

3

Hot Code

Cold Code

More Code

Condition

Code

Page 9: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Hot/Cold Code Splitting

• PH: A popular greedy approach• Structural Analysis Based Basic Block Reordering

– Most part of a program can be

decomposed into several typical structures

– Cost Module for each structure

– Minimal-cost layout Optimal layout

for each local structure based on

profiling information

B1

B2 B3

(d) Whi l e- l oop

yx

y

B1

B2

B3

(e) Repeat- l oop

yx

y

B1

B2 B3

(f ) Natural - l oop

yx

z

B4x1

x2

B1

B2 B3

(g) Natural - l oop

yx

B4

y1 y2

B5

B1

B2

(a) Bl ock模型

B1

B2

B3

(b) I f - then

B1

B2

B4

(c) I f - then-el se

y

x

B3

yx

Bn

. . .

Page 10: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Basic Block Reordering

• Cost Model– Different kinds of control flow

edges have different cost– For a specific order,

– A list can be got for each structure

f (structure, frequencies of all edges) the best order of basic blocks for the local structure

...

...

(a)

L1:...

...b L1

(b)

A

B

B

A

cmp …beq L1

...

(c)

A

B

L1:...

C

L1:...

L2:...

(d)

C

B

cmp …beq L1b L2

A

( )* ( )e

Cost Cost e Frequency e

control flow edges

Page 11: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Experiments

• Complexity: O(N*log N) , N: number of basic blocks• Experiment results (not using other link-time

optimizations)• Normalized cycle counts Normalized cache miss rate

总体性能

0. 75

0. 8

0. 85

0. 9

0. 95

1

1. 05

ORI GPHSABO

Cache指令 失效

0. 50. 550. 60. 650. 70. 750. 80. 850. 90. 9511. 051. 11. 151. 21. 251. 31. 351. 41. 451. 51. 551. 6

adpc

m-en

code ep

i c

j peg

- enc

ode

pegw

i t- e

ncod

e

mipm

ap

osde

mo

ORI GPHSABO

Page 12: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Mixed Code Generation

• Dual-width Instruction Set– 32-bit ISA: more powerful– 16-bit ISA: more compact

• Less coding space for operations• Less register field• Less immediate field

32-bit:add r0, r0,

0xff800000

16-bit:str r2, [addr]mov r2, 0xfflsl r2, #1add r2, #1lsl r2, 24 add r0, r2ld r2, [addr]

Page 13: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Mixed Code Generation

• Related works in dual-width Instruction Set design and mixed code generation– Coarse-grained function-level mixed code generation

• By BX in arm and JALX in MIPS

– Simple fine-grained instruction-level mixed code generation

• By BX in arm and JALX in MIPS• By single specific mode-changing instruction

– Specialized coding• One-leading instruction word indicates one 32-bit instruction; Zero-leading instruction word indicates two 16-bit instruction.• 16-bit ISA extensions

• Problem: Always lead to performance loss

Page 14: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Potential benefit

• Analysis of Programs in Mediabench

27851 different instructions in all programs:

Log(27851)=15

RankUnicore32

InstructionAverage

Percentage

1 mov 23%

2 ldr 16%

3 cmp 8%

4 add 8%

5 str 6%

6 b 5%

Total 66%

1 2

Page 15: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Two Main Kinds of Frequent Instructions

• Two-operand instructions mov rd, rm

or short immediate

cmp rn, rm

or short immediate

• Branch/Jump – Distribution of immediate-

offsets of branch instructions.

00. 020. 040. 060. 080. 1

0. 120. 140. 160. 180. 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Number of bi ts needed

perc

enta

ge

Page 16: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

The Idea of Mode-Changing Instruction Set (MC)

• Extend the 32-bit ISA to add a small MC Instruction Set (using the reserved coding space)– Change the CPU mode

– Perform its own normal operation

• Scan for suitable 32-bit instructions to be encoded into 16-bit instructions

• A mixed code fraction with MC instructions

32-bit instructions

MC instruction UniCore16 instruction

UniCore16 instruction UniCore16 instruction

… …

UniCore16 instruction UniCore16 instruction

MC instruction UniCore16 instruction

32-bit instructions

Page 17: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Modification to Micro Architecture

• Mixed code execution in Unicore-I pipeline

• Improved mixed code executionin Unicore-I pipeline

DECIF

IF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

DECIF

IF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

Inst 4, UniCore16

Inst 5, UniCore16

Inst 3, UniCore16

Inst 2, UniCore16

Inst 1, BX, UniCore32

Inst 4, UniCore16

Inst 5, UniCore16

Inst 3, UniCore16

Inst 2, UniCore16

Inst 1, BX, UniCore32

DECIF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

DECIF

EXEDECIF

MEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

WBMEMEXEDECIF

Inst 6, UniCore32

Inst 5, MC

Inst 4, UniCore16

Inst 3, UniCore16

Inst 2, MC

Inst 1, UniCore32

Inst 6, UniCore32

Inst 5, MC

Inst 4, UniCore16

Inst 3, UniCore16

Inst 2, MC

Inst 1, UniCore32

No extra cycles

One more 16-bit instruction-fetch buffer

An MC-decoder

Page 18: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Mixed Code Generation

programprogram

program

Mode-Changing

Instructions

InstructionAnalyzer

Link-Time Optimizer

Mixed coded

Program

program

Simulator

Page 19: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Experiment Results

• Normalized code size (results not using other link-time optimizations)

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

Uni Core32 Uni Core16 Mi xed

Page 20: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Conclusion

• Code compaction on Link-Time Optimization Platform– Compiler optimizations applied at link time

• Typical optimizations for code size reduction

– Program layout optimization• Hot/cold code splitting through basic block reordering

– Machine code generation• Mixed code generation

• Experiment Results– Average code size reduction: 32.9% – Average performance improvement: 9.1%

Page 21: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Thank you

Page 22: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

Page 23: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

• Instruction Analysis

3 regs, all in r0-r7 / r8-r15 / r16-r23/ r24-r312 regs, one in r0-r31, one in r0-r16 / r17-r311 reg and 1 imme, imme field: 4-6 bits1 imme, imme field: 9 bitsreg: short for registerimme: short for immediate field

Instruction format type classifications

Page 24: August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.

August August ICDFN 2006ICDFN 2006

EXPERIMENT RESULTS

• Normalized dynamic instruction numbers

• Normalized cycle counts

0

1

2

3

4

5

6

adpcm-encode

adpcm-decode

epi c unepi c pegwi t-encode

pegwi t-decode

j peg-encode

j peg-decode

mpeg2-encode

mpeg2-decode

mesa-mi pmap

mesa-texgen

mesa-osdemo

Uni Core32 Uni Core16 Mi xed

0

0. 5

1

1. 5

2

2. 5

3

3. 5

4

4. 5

5

adpcm-encode

adpcm-decode

epi c unepi c pegwi t-encode

pegwi t-decode

j peg-encode

j peg-decode

mpeg2-encode

mpeg2-decode

mesa-mi pmap

mesa-texgen

mesa-osdemo

Uni Core32 Uni Core16 Mi xed