Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware...

fakultät für informatikinformatik 12

technische universität dortmund

Memory-architecture aware compilation

- Session 13 -

Peter MarwedelTU DortmundInformatik 12

Germany

- 2 -technische universitätdortmund

fakultät für informatik

p. marwedel, informatik 12, 2008

TU Dortmund

Schedule of the course

Time Monday Tuesday Wednesday Thursday Friday

09:30-11:00

1: Orientation, introduction

2: Models of computation + specs


9: Mapping of applications to platforms

13: Memory aware compilation


11:00 Brief break Brief break Brief break Brief break

11:15-12:30

6: Lab*: Ptolemy

10: Lab*: Scheduling

14: Lab*: Mem. opt.

18: Lab*: Mem. opt.

12:30 Lunch Lunch Lunch Lunch Lunch

14:00-15:20



11: High-level optimizations*


19: WCET & compilers*

15:20 Break Break Break Break Break

15:40-17:00

4: Lab*: Kahn process networks


12: High-level optimizations*


20: Wrap-up

* Dr. Heiko Falk




TU Dortmund

1. Increasing speed gap

2. Major consumer of electrical energy

3. Timing predictability difficult to achieve

4. …

The Problem with Memories

Or: Why work on processors if memory is where the bottleneck is?

Memories?

Oops! Memories!




TU Dortmund

Trends for the speeds

Speed gap between processor and main DRAM increases

[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]

2

4

8

2 4 5

Speed

years

CPU Per

form

ance

(1.5

-2 p

.a.)

DRAM (1.07 p.a.)

31

2x every 2 years

10

Similar problems also for embedded systems & MPSoCs

In the future:Memory access times >> processor cycle times

“Memory wall” problem




TU Dortmund

Importance of Energy Efficiency

Cou

rtes

y: P

hilip

s

© Hug

o D

e M

an,

IME

C,

200

7

Efficient software design needed, otherwise, the price for software flexibility cannot be paid.Efficient software design needed, otherwise, the price for software flexibility cannot be paid.

poor design techniques

IPE=Inherent power efficiencyAmI=Ambient Intelligence

GO

Ps/

J




TU Dortmund

Energy consumption in mobile devices

[O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.




TU Dortmund

Memory system frequently consumes >50 % of the energy used for processing

Multiprocessor with cache ($)

Cache ($)-less monoprocessor

Average over 200 benchmarks analyzed by Verma (U. Dortmund)

[M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, 2007]

29%

71%

Processor Energy

Main Mem. Energy

14,8%

5,2%

28,1%51,9%

Proc. Energy

I-Cache Energy

D-Cache Energy

Main Mem.Energy




TU Dortmund

Similar information according to other sources

Strong ARM

Icache26%

Dcache16%

Clock10%

IMMU9%

EBOX8%

DMMU8%

Others5%

Ibox18%

IEEE Journal of SSC Nov. 96

[Based on slide by and ©: Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001] [Segars 01 according to Vahid@ISSS01]




TU Dortmund

“Everything“ is large for large memories

* Monolithic register file; Rixner’s et al. model [HPCA’00], Technology of 0.18 mm; VLIW configurations for a certain number of ports („GPxMyREGz where: x={6}, y={2, 3} and z={16, 32, 64, 128“}; Based on slide by and ©: Harry Valero, U. Barcelona, 2001

Cycle Time (ns)* Area (2x106) Power (W)

1

1,1

1,2

1,3

1,4

1,5

1,6

1,7

1,8

16 32 64 128

Register File Size

0

1

2

3

4

5

6

7

16 32 64 128

GP6M2 GP6M3

0

2

4

6

8

10

12

14

16 32 64 128

Register File Size




TU Dortmund

Dependency on the size

Energy

Access timesApplications are getting larger and larger …

Sub-banking

+ locality of references Memory hierarchies




TU Dortmund

Timing Predictability

G.721: using unified Cache@ARM7TDMI

Worst case execution time (WCET) larger than without cache

See later slide for experimental setup

Many embedded systems are real-time systems

computations to be finished in a given amount of time

Most memory hierarchies (e.g. caches) for PC-like systems designed for good average case, not for good worst case behavior.




TU Dortmund

Multiple Objectives for Memory System Design

(Average) Performance

• Throughput

• Latency

Energy consumption

Predictability, good worst caseexecution time bound (WCET)

Size

Cost

….




TU Dortmund

Pareto curves

better

worse




TU Dortmund

Pareto points

Definition: A (design) point Ji is dominated by pointJk, if Jk is equal or better than Ji in each criterion (Ji Jk).

Definition: A (design) point is Pareto-optimal or aPareto point, if it is not dominated by any other point.




TU Dortmund

Energy models

Commercial tools frequently very imprecise

Model of Tiwari (Dissertation, Princeton 1996):Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access

Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.

Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.

Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls

Dedicated energy models.




TU Dortmund

Source of energy models1) measurements

E.g.: ATMEL board with ARM7TDMI and ext. SRAM

DataMemory

InstructionMemory

IAddr

VDD

ALU

Multi-plier

BarrelShifter

Register File

Instr. Decoder& Control Logic

Inst

rImm

RegValue

Reg#

Opcode

ARM7DAddr

mA mA

Data

Instr

h-costs, address bus, CPU + memory current

110

115

120

125

130

135

140

145

150

0 5 10 15 20

Hamming-Distance

Cu

rren

t [m

A]

Read

Write




TU Dortmund

Instruction dependent costs in the CPU

Ecpu_instr =MinCostCPU(Opcodei) +1 * w(Immi,j) + ß1 * h(Immi-1,j, Immi,j) +2 * w(Regi,k) + ß2 * h(Regi-1,k, Regi,k) +3 * w(RegVali,k) + ß3 * h(RegVali-1,k, RegVali,k) +4 * w(IAddri) + ß4 * h(IAddri-1, IAddri) +FUCost(Instri-1,Instri)

Cost for a sequence of m instructions

w: number of ones;h: Hamming distance;FUCost: cost of switching functional units, ß: determined through experiments




TU Dortmund

Other costs

Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)

Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)

Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)

Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)

Emem_data = MinCostMem (DataMem, Direction, Word_widthi)

+ 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)

Emem_data = MinCostMem (DataMem, Direction, Word_widthi)

+ 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)




TU Dortmund

Results

It is not important, which address bit is set to ‚1‘

The number of ‚1‘s in the address bus is irrelevant

The cost of flipping a bit on the address bus is independent of the bit position.

It is not important, which data bit is set to ‚1‘

The number of ‚1‘s on the data bus has a minor effect (3%)

The cost of flipping a bit on the data bus is independent of the bit position.




TU Dortmund

Hamming Distance between adjacent addresses is playing a major role

h-costs, address bus, CPU + memory current

110

115

120

125

130

135

140

145

150

0 5 10 15 20

Hamming-Distance

Cu

rren

t [m

A]

Read

Write




TU Dortmund

Source of energy models2) computer models, e.g. CACTI

Comparison with SPICE

Cache model used

http://research.compaq.com/wrl/people/jouppi/CACTI.html




TU Dortmund

Register allocation

Registers = fastest level in the memory hierarchy Interest in good global register allocation techniques

Frequently based on coloring of interference graph

Registers neither suitable for |objects| ≫ single words nor for code

Register allocation not considered in the remainder

lifetimes

v1

v2v3

v1

v2

v3




TU Dortmund

Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable

Address space

ARM7TDMI cores, well-known for low power consumption

scratch pad memory

0

FFF..

ExampleExample

Small; no tag memory

SPMs are small, physically separate memories mapped into the address space;

Selection is by an appropriate address decoder (simple!)

SPM

select




TU Dortmund

Predictability and scratch-pad memories

… In essence, we must reinvent computer science. Fortunately, we have quite a bit of knowledge and experience to draw upon. Architecture techniques such as software-managed caches promise to deliver much of the benefit of memory hierarchy without the timing unpredictability.

[Ed Lee: Absolutely Positively on Time: What would it take?, IEEE Computer, 2005]

… pre-run-time scheduling is often the only practicalmeans of providing predictability in a complex system.

[J. Xu, D. Parnas: On satisfying timing constraints in hard real-time systems, IEEE Trans. Soft. Engineering, 1993, p. 70–84]




TU Dortmund

Comparison of currents using measurements

E.g.: ATMEL board with ARM7TDMI andext. SRAM

Current32 Bit-Load Instruction (Thumb)

48,2 50,9 44,4 53,1

11677,2 82,2

1,16

0

50

100

150

200

Prog Main/ DataMain

Prog Main/ DataSPM

Prog SPM/ DataMain

Prog SPM/ Data SPM

mA

Core+SPM (mA) Main Memory Current (mA)




TU Dortmund

Scratchpad vs. main memory energy

Energy32 Bit-Load Instruction (Thumb)

115,8

51,6

76,5

16,4

0,020,040,060,080,0

100,0120,0140,0

Prog Main/DataMain

Prog Main/ DataSPM

Prog SPM/ DataMain

Prog SPM/Data SPM

nJ

Energy

Example: Atmel ARM-Evaluation board Savings (86%) even larger.

energy reduction:/ 7.06

100% predictable

energy reduction:/ 7.06

100% predictable

€




TU Dortmund

Why not just use a cache ?

0

1

2

3

4

5

6

7

8

9

256 512 1024 2048 4096 8192 16384

memory size

En

erg

y p

er

ac

ce

ss

[n

J]

.

Scratch pad

Cache, 2way, 4GB space

Cache, 2way, 16 MB space

Cache, 2way, 1 MB space

Energy consumption in tags, comparators and muxes is significant.

[R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel. Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems, Intern. Workshop on Hardware/ Software Codesign (CODES), 2002]




TU Dortmund

Set-associative cachen-way cache

Address

Data

Tag Index

Tags data blockTags

= =

n = 2

1

data block

way 0 way 1$ (€)




TU Dortmund

Influence of the associativity

Technology different from previous slide.




TU Dortmund

Availability of SPMs=(“Tightly Coupled Memories”)

Source: http://www.arm.com/products/CPUs/core_selector.html

ARM CPU Core Caches TCM AvailableARM 1026EJ-S Variable yes

ARM 1136J(F)-S Variable yes

ARM 1176JZ(F)-S Variable yes

ARM 926EJ-S Variable yes

ARM 1026EJ-S Variable yes

ARM 1156T2(F)-S Variable yes

ARM 946E-S Variable yes

ARM 966E-S - yes

ARM 968E-S - yes

All others no




TU Dortmund

Current usage for ARM

1. Use pragma in C-source to allocate to specific section:For example:#pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region)int const z2[3] = {1,2,3}; // in bar

2. Input scatter loading file to linker for allocating section to specific address range

http://www.arm.com/documentation/ Software_Development_Tools/index.html

(2 d

iffe

rent

exa

mp

les)

LOAD_ROM_1 0x000{

EXEC_ROM_1 0x000{

}program1.o (+RO)

program1.o (+RW,+ZI)

DRAM 0x18000 0x8000{

}

Scatter description

Load region descriptionExecution region description

Input section description

Input section description

Execution region description




TU Dortmund

Popular among designers?

Information received from designers:

Used for buffering speech in mobile phones(mobile phone company)

Manual mapping of frequently accessed data

Essentially no idea on how to exploit it (WLAN/Bluetooth specialists & major vendor)

Why not change this situation?




TU Dortmund

Similar Conceptfor the Cell processor *

Motivation same as for this tutorial: Large memory latency Huge overhead for

automatically managed caches

Similar for Infineon TriCore

Local SPE processors fetch instructions and data from local storage LS (256 kB).LS not designed as a cache. Separate DMA transfers required to fill and spill.

* Sony, IBM, Toshiba

Main Memory




TU Dortmund

Migration of instructions to small memoryYasuura’s architecture (Kyushu U.)

e

2 instruction memories: main memory + “decompressor” memory;Compiler optimizing code allocation & size of the 2 instruction memories

2 instruction memories: main memory + “decompressor” memory;Compiler optimizing code allocation & size of the 2 instruction memories

Merged instruction

Small andLow Power

Large andHigh Power

[T. Ishihara, H. Yasuura: A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors, Design and Automation in Europe Conference (DATE), 2000]




TU Dortmund

Migration of data and instructions, global optimization model (TU Dortmund)

Which object (array, loop, etc.) to be stored in SPM?

Non-overlaying memory allocation:

Gain gk & size sk for each object k.

Maximise gain G = gk, respecting size of SPM SSP sk.

Solution: Knapsack algorithm.

Overlaying allocation:

Moving objects back and forth between hierarchy levelsProcessor

Scratch pad memory,capacity SSP

main memory

?

For i .{ }

for j ..{ }

while ...

Repeat

function ...

Array ...

Int ...

Array

Example:




TU Dortmund

ILP representation- migrating functions and variables-

Symbols:

S(vark ) = size of variable k

nk = number of accesses to variable k

e(vark ) = energy saved per variable access, if vark is migrated

E(vark ) = energy saved if variable vark is migrated (= e(vark ) n(vark ))

x(vark ) = decision variable, =1 if variable k is migrated to SPM, =0 otherwise

K = set of variables

Similar for functions I

Integer linear programming formulation:

Maximize kK x(vark ) E(vark ) + iI x(Fi ) E(Fi )

Subject to the constraint

i I S (Fi ) x(Fi ) + k K S (vark ) x(vark ) SSP




TU Dortmund

Reduction in energy and average run-time

Multi_sort (mix of sort algorithms)

Cyc

les

[x1

00

] E

ner

gy

[µJ]

Feasible with standard compiler & pre- or postpass optimization

Measured processor / external memory energy + CACTI values for SPM (combined model)

Numbers will change with technology, algorithms remain unchanged.




TU Dortmund

Allocation of basic blocks

Fine-grained granularity smoothens dependency on the size of the scratch pad.

Requires additional jump instructions to return to "main" memory.

Fine-grained granularity smoothens dependency on the size of the scratch pad.

Requires additional jump instructions to return to "main" memory.

Main memory

BB1

BB2

Jump1

Jump2

Jump4

Jump3

For consecutive basic blocks

Statically 2 jumps,but only one is taken




TU Dortmund

Taking consecutive basic blocksinto account

Approach:

• Consider sets of consecutive BBs as a new kind of basic blocks (“multi blocks”)

• Add a constraint preventing the same block from being selected twice:x(BBb ) + x(Fi ) + j multiblocks(b), j x x(BBj ) 1

b {blocks} {multi blocks}

Block b is either moved individually, as part of a function, as part of one of its enclosing multi-blocks or not at all.

BB1

BB3

BB2

BB12

BB23

BB123




TU Dortmund

Allocation of basic blocks, sets of adjacent basic blocks and the entire stack

Requires generation of additional jumps (special compiler) or procedure exlining (“procedural abstraction”)

Cyc

les

[x1

00

] E

ner

gy

[µJ]




TU Dortmund

Savings for memory system energy alone

Combined model for memories




TU Dortmund

Underlying reasons for large energy savings

Potential energy saving is determined by the difference between the energy consumption for large and small memories.

In this case:

factor of 20 savings of up to 95%

Can even be larger for larger memories




TU Dortmund

Summary

Multiple objectives for the design of the memory system

• Energy efficiency

• (predictable) performance, cost, …

• small is beautiful (efficient) efficiency(memory hierarchies)>efficiency(single mem)

Energy models: based on measurements+toolsScratch pads: energy-efficient, fast, timing predictableScratch pad allocation strategies:

• Static (non-overlaying)- Globals,

- functions, basic blocks




TU Dortmund

Brief break (if on schedule)

Q&A?

Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware...

Documents

Transcript of Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware...