Download - Compilers for embedded systems: Why are compilers an issue?

Transcript
Page 1: Compilers for embedded systems: Why are compilers an issue?

- 1 -EE898- Compiler

Compilers for embedded systems:Why are compilers an issue?

Many reports about low efficiency of standard compilers Special features of embedded processors have to be exploited. High levels of optimization more important than compilation

speed. Compilers can help to reduce the energy consumption (energy

optimization). Compilers could help to meet real-time constraints.

Less legacy problems than for PCs. There is a large variety of instruction sets. Design space exploration for optimized processors makes sense

Many reports about low efficiency of standard compilers Special features of embedded processors have to be exploited. High levels of optimization more important than compilation

speed. Compilers can help to reduce the energy consumption (energy

optimization). Compilers could help to meet real-time constraints.

Less legacy problems than for PCs. There is a large variety of instruction sets. Design space exploration for optimized processors makes sense

Page 2: Compilers for embedded systems: Why are compilers an issue?

- 2 -EE898- Compiler

Use of assembly languages in embedded systems

Assembler (DSP)

Assembler (µController)

C-Code (µController)

C-Code (DSP)

[Paulin, 1995]

Similar situation more recently

Page 3: Compilers for embedded systems: Why are compilers an issue?

- 3 -EE898- Compiler

Optimizations considered

• Energy-aware compilation• Compilation for digital signal processors• Compilation for multimedia processors• Compilation for VLIW (very long instruction word) processors• Compilation for network processors• Compiler generation, retargetable compilers and design space

exploration

• Energy-aware compilation• Compilation for digital signal processors• Compilation for multimedia processors• Compilation for VLIW (very long instruction word) processors• Compilation for network processors• Compiler generation, retargetable compilers and design space

exploration

Page 4: Compilers for embedded systems: Why are compilers an issue?

- 4 -EE898- Compiler

Efforts for Reducing Energy

• Device Level– Development of Low Power Devices– Reducing Power Supply Voltage– Reducing Threshold Voltage

• Circuit Level– Gated Clock– Path Transistor Logic– Asynchronous Circuits

• System Level• Fabrication level

• Device Level– Development of Low Power Devices– Reducing Power Supply Voltage– Reducing Threshold Voltage

• Circuit Level– Gated Clock– Path Transistor Logic– Asynchronous Circuits

• System Level• Fabrication level

Page 5: Compilers for embedded systems: Why are compilers an issue?

- 5 -EE898- Compiler

Optimization for low-energy the same as optimization for high performance?

int a[1000];c = a;for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1;}

int a[1000];c = a;for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1;}

LDR r3, [r2, #0]ADD r3,r0,r3MOV r0,#28LDR r0, [r2, r0]ADD r0,r3,r0ADD r2,r2,#4ADD r1,r1,#1CMP r1,#100BLT LL3

ADD r3,r0,r2MOV r0,#28MOV r2,r12MOV r12,r11MOV r11,rr10MOV r0,r9MOV r9,r8MOV r8,r1LDR r1, [r4, r0]ADD r0,r3,r1ADD r4,r4,#4ADD r5,r5,#1CMP r5,#100BLT LL3

2231 cycles16.47 µJ

2096 cycles19.92 µJ

No !• High-performance if available memory bandwidth fully used; low-energy consumption if memories are at stand-by mode• Reduced energy if more values are kept in registers

Page 6: Compilers for embedded systems: Why are compilers an issue?

- 6 -EE898- Compiler

Energy models

• Commercial tools frequently very imprecise• Model of Tiwari (Dissertation, Princeton 1996):

Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access

• Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.

• Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.

• Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls

Dedicated energy models.

• Commercial tools frequently very imprecise• Model of Tiwari (Dissertation, Princeton 1996):

Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access

• Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.

• Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.

• Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls

Dedicated energy models.

Page 7: Compilers for embedded systems: Why are compilers an issue?

- 7 -EE898- Compiler

Energy Model by Knauer and Steinke

DataMemory

InstructionMemoryInstr

IAddr

Data

VDD

ALU

Multi-plier

BarrelShifter

Register File

Instr. Decoder& Control Logic

Inst

rImm

RegValue

Reg#

Opcode

ARM7

DAddr

Etotal = Ecpu_instr + Ecpu_data + Emem_instr + Emem_data

Page 8: Compilers for embedded systems: Why are compilers an issue?

- 8 -EE898- Compiler

Instruction dependent costs in the CPU

Ecpu_instr =

MinCostCPU(Opcodei) +1 * w(Immi,j) + ß1 * h(Immi-1,j, Immi,j) +2 * w(Regi,k) + ß2 * h(Regi-1,k, Regi,k) +

3 * w(RegVali,k) + ß3 * h(RegVali-1,k, RegVali,k) +

4 * w(IAddri) + ß4 * h(IAddri-1, IAddri) +

FUCost(Instri-1,Instri)

Ecpu_instr =

MinCostCPU(Opcodei) +1 * w(Immi,j) + ß1 * h(Immi-1,j, Immi,j) +2 * w(Regi,k) + ß2 * h(Regi-1,k, Regi,k) +

3 * w(RegVali,k) + ß3 * h(RegVali-1,k, RegVali,k) +

4 * w(IAddri) + ß4 * h(IAddri-1, IAddri) +

FUCost(Instri-1,Instri)

Cost for a sequence of m instructions

w: number of ones;h: Hamming distance;FUCost: cost of switching functional units, ß: determined through experiments

Page 9: Compilers for embedded systems: Why are compilers an issue?

- 9 -EE898- Compiler

Other costs

Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)

Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)

Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)

Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)

Emem_data = MinCostMem (DataMem, Direction, Word_widthi) + 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)

Emem_data = MinCostMem (DataMem, Direction, Word_widthi) + 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)

Page 10: Compilers for embedded systems: Why are compilers an issue?

- 10 -EE898- Compiler

Results

• It is not important, which address bit is set to ‘1’• The number of ‚1‘s in the address bus is irrelevant• The cost of flipping a bit on the address bus is independent of

the bit position.• It is not important, which data bit is set to ‘1’• The number of ‚1‘s on the data bus has a minor effect (3%)• The cost of flipping a bit on the data bus is independent of the

bit position.

• It is not important, which address bit is set to ‘1’• The number of ‚1‘s in the address bus is irrelevant• The cost of flipping a bit on the address bus is independent of

the bit position.• It is not important, which data bit is set to ‘1’• The number of ‚1‘s on the data bus has a minor effect (3%)• The cost of flipping a bit on the data bus is independent of the

bit position.

Page 11: Compilers for embedded systems: Why are compilers an issue?

- 11 -EE898- Compiler

Compiler optimizations for improving energy efficiency

Energy-aware scheduling Energy-aware instruction selection Operator strength reduction: e.g. replace * by + and << Minimize the bitwidth of loads and stores Standard compiler optimizations with energy as a cost function

Energy-aware scheduling Energy-aware instruction selection Operator strength reduction: e.g. replace * by + and << Minimize the bitwidth of loads and stores Standard compiler optimizations with energy as a cost function

E.g.: Register pipelining:

for i:= 0 to 10 do C:= 2 * a[i] + a[i-1];

R2:=a[0];for i:= 1 to 10 do begin R1:= a[i]; C:= 2 * R1 + R2; R2 := R1; end;

Exploitation of the memory hierarchy Exploitation of the memory hierarchy

Page 12: Compilers for embedded systems: Why are compilers an issue?

- 12 -EE898- Compiler

3 key problems for future memory systems

Energy

Access times

1. (Average) Speed

2. Energy/Power

3. Predictability/Worst Case Execution Time

1. (Average) Speed

2. Energy/Power

3. Predictability/Worst Case Execution Time

smaller, faster, less energy

Page 13: Compilers for embedded systems: Why are compilers an issue?

- 13 -EE898- Compiler

1. (Average) Speed

Speed gap between processor and main DRAM increasesSpeed gap between processor and main DRAM increases

2

4

8

2 4 5

Speed

years

CPU

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

• early 60ties (Atlas):page fault ~ 2500 instructions

• 2002 (2 GHz µP):access to DRAM ~ 500 instructions

penalty for cache miss soon same as for page fault in Atlas

• early 60ties (Atlas):page fault ~ 2500 instructions

• 2002 (2 GHz µP):access to DRAM ~ 500 instructions

penalty for cache miss soon same as for page fault in Atlas

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

2x every 2 years

10

Page 14: Compilers for embedded systems: Why are compilers an issue?

- 14 -EE898- Compiler

2. Power/Energy

0

0.5

1

1.5

2

2.5

64 128 256 512 1024 2048 4096 8192

Memory size [bytes]

En

erg

y p

er a

cce

ss

[nJ

]Example (CACTI Model):

[Steinke et al., Inf 12, UniDo, 2002]

Page 15: Compilers for embedded systems: Why are compilers an issue?

- 15 -EE898- Compiler

3. Predictability/WCET

• Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern;pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas] Time-triggered, statically scheduled operating systems

• What about memory accesses? – Currently available caches don‘t solve the problem:

• Improve the average case behavior• Use “non-deterministic“ cache replacement algorithms

Scratch-pad/tightly coupled memory based predictability

• Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern;pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas] Time-triggered, statically scheduled operating systems

• What about memory accesses? – Currently available caches don‘t solve the problem:

• Improve the average case behavior• Use “non-deterministic“ cache replacement algorithms

Scratch-pad/tightly coupled memory based predictability

Page 16: Compilers for embedded systems: Why are compilers an issue?

- 16 -EE898- Compiler

Hierarchical memoriesusing scratch pad memories (SPM)

Address spaceAddress space ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

main

SPM

processor

HierarchyHierarchy

ExampleExample

no tag memory

Page 17: Compilers for embedded systems: Why are compilers an issue?

- 17 -EE898- Compiler

Exploitation of SPM

Which segment (array, loop, etc.) to be stored in SPM?

Gain gi and size si for each segment i.

Maximise gain G = gi, respecting constraint K si.

Static memory allocation:

Solution: knapsack algorithm.

Dynamic reloading:

Finding optimal reloading points.Processor

SPMcapacity K

board

Main memory (On-board)

?

For i .{ }

for j ..{ }

while ...

Repeat

call ...

Array ...

Int ...

Array

Example:

Page 18: Compilers for embedded systems: Why are compilers an issue?

- 18 -EE898- Compiler

Reduction in energy and average run-time

Multi_sort (mix of sort algorithms)

Cyc

les

Page 19: Compilers for embedded systems: Why are compilers an issue?

- 19 -EE898- Compiler

Energy consumption per functional unit,as a function of the SPM size

Parameters different from previous slide

Parameters different from previous slide

Page 20: Compilers for embedded systems: Why are compilers an issue?

- 20 -EE898- Compiler

Hardware-support for block-copying

• The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.

• The unit can be put to sleep when it is unused.

• The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.

• The unit can be put to sleep when it is unused.

• Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying.

• Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying.

DMA MEMORY CPU