Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware...
-
Upload
elfriede-dotzler -
Category
Documents
-
view
110 -
download
2
Transcript of Fakultät für informatik informatik 12 technische universität dortmund Memory-architecture aware...
fakultät für informatikinformatik 12
technische universität dortmund
Memory-architecture aware compilation
- Session 13 -
Peter MarwedelTU DortmundInformatik 12
Germany
- 2 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Schedule of the course
Time Monday Tuesday Wednesday Thursday Friday
09:30-11:00
1: Orientation, introduction
2: Models of computation + specs
5: Models of computation + specs
9: Mapping of applications to platforms
13: Memory aware compilation
17: Memory aware compilation
11:00 Brief break Brief break Brief break Brief break
11:15-12:30
6: Lab*: Ptolemy
10: Lab*: Scheduling
14: Lab*: Mem. opt.
18: Lab*: Mem. opt.
12:30 Lunch Lunch Lunch Lunch Lunch
14:00-15:20
3: Models of computation + specs
7: Mapping of applications to platforms
11: High-level optimizations*
15: Memory aware compilation
19: WCET & compilers*
15:20 Break Break Break Break Break
15:40-17:00
4: Lab*: Kahn process networks
8: Mapping of applications to platforms
12: High-level optimizations*
16: Memory aware compilation
20: Wrap-up
* Dr. Heiko Falk
- 3 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
1. Increasing speed gap
2. Major consumer of electrical energy
3. Timing predictability difficult to achieve
4. …
The Problem with Memories
Or: Why work on processors if memory is where the bottleneck is?
Memories?
Oops! Memories!
- 4 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Trends for the speeds
Speed gap between processor and main DRAM increases
[P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]
2
4
8
2 4 5
Speed
years
CPU Per
form
ance
(1.5
-2 p
.a.)
DRAM (1.07 p.a.)
31
2x every 2 years
10
Similar problems also for embedded systems & MPSoCs
In the future:Memory access times >> processor cycle times
“Memory wall” problem
- 5 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Importance of Energy Efficiency
Cou
rtes
y: P
hilip
s
© Hug
o D
e M
an,
IME
C,
200
7
Efficient software design needed, otherwise, the price for software flexibility cannot be paid.Efficient software design needed, otherwise, the price for software flexibility cannot be paid.
poor design techniques
IPE=Inherent power efficiencyAmI=Ambient Intelligence
GO
Ps/
J
- 6 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Energy consumption in mobile devices
[O. Vargas (Infineon Technologies): Minimum power consumption in mobile-phone memory subsystems; Pennwell Portable Design - September 2005;] Thanks to Thorsten Koch (Nokia/ Univ. Dortmund) for providing this source.
- 7 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Memory system frequently consumes >50 % of the energy used for processing
Multiprocessor with cache ($)
Cache ($)-less monoprocessor
Average over 200 benchmarks analyzed by Verma (U. Dortmund)
[M. Verma, P. Marwedel: Advanced Memory Optimization Techniques for Low-Power Embedded Processors, Springer, 2007]
29%
71%
Processor Energy
Main Mem. Energy
14,8%
5,2%
28,1%51,9%
Proc. Energy
I-Cache Energy
D-Cache Energy
Main Mem.Energy
- 8 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Similar information according to other sources
Strong ARM
Icache26%
Dcache16%
Clock10%
IMMU9%
EBOX8%
DMMU8%
Others5%
Ibox18%
IEEE Journal of SSC Nov. 96
[Based on slide by and ©: Osman S. Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001] [Segars 01 according to Vahid@ISSS01]
- 9 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
“Everything“ is large for large memories
* Monolithic register file; Rixner’s et al. model [HPCA’00], Technology of 0.18 mm; VLIW configurations for a certain number of ports („GPxMyREGz where: x={6}, y={2, 3} and z={16, 32, 64, 128“}; Based on slide by and ©: Harry Valero, U. Barcelona, 2001
Cycle Time (ns)* Area (2x106) Power (W)
1
1,1
1,2
1,3
1,4
1,5
1,6
1,7
1,8
16 32 64 128
Register File Size
0
1
2
3
4
5
6
7
16 32 64 128
GP6M2 GP6M3
0
2
4
6
8
10
12
14
16 32 64 128
Register File Size
- 10 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Dependency on the size
Energy
Access timesApplications are getting larger and larger …
Sub-banking
+ locality of references Memory hierarchies
- 11 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Timing Predictability
G.721: using unified Cache@ARM7TDMI
Worst case execution time (WCET) larger than without cache
See later slide for experimental setup
Many embedded systems are real-time systems
computations to be finished in a given amount of time
Most memory hierarchies (e.g. caches) for PC-like systems designed for good average case, not for good worst case behavior.
- 12 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Multiple Objectives for Memory System Design
(Average) Performance
• Throughput
• Latency
Energy consumption
Predictability, good worst caseexecution time bound (WCET)
Size
Cost
….
- 13 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Pareto curves
better
worse
- 14 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Pareto points
Definition: A (design) point Ji is dominated by pointJk, if Jk is equal or better than Ji in each criterion (Ji Jk).
Definition: A (design) point is Pareto-optimal or aPareto point, if it is not dominated by any other point.
- 15 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Energy models
Commercial tools frequently very imprecise
Model of Tiwari (Dissertation, Princeton 1996):Cost of instructions and of transitions between instructions;Does not separate out the cost of memory access
Model of Simunic, de Micheli (DAC 99):Model based on data sheets; does not require measurements.Does not take transitions into account.
Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations;cannot predict effect of changes to memory architecture.
Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls
Dedicated energy models.
- 16 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Source of energy models1) measurements
E.g.: ATMEL board with ARM7TDMI and ext. SRAM
DataMemory
InstructionMemory
IAddr
VDD
ALU
Multi-plier
BarrelShifter
Register File
Instr. Decoder& Control Logic
Inst
rImm
RegValue
Reg#
Opcode
ARM7DAddr
mA mA
Data
Instr
h-costs, address bus, CPU + memory current
110
115
120
125
130
135
140
145
150
0 5 10 15 20
Hamming-Distance
Cu
rren
t [m
A]
Read
Write
- 17 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Instruction dependent costs in the CPU
Ecpu_instr =MinCostCPU(Opcodei) +1 * w(Immi,j) + ß1 * h(Immi-1,j, Immi,j) +2 * w(Regi,k) + ß2 * h(Regi-1,k, Regi,k) +3 * w(RegVali,k) + ß3 * h(RegVali-1,k, RegVali,k) +4 * w(IAddri) + ß4 * h(IAddri-1, IAddri) +FUCost(Instri-1,Instri)
Cost for a sequence of m instructions
w: number of ones;h: Hamming distance;FUCost: cost of switching functional units, ß: determined through experiments
- 18 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Other costs
Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)
Ecpu_data = 5 * w(DAddri) + ß5 * h(DAddri-1, DAddri) +6 * w(Datai) + ß6 * h(Datai-1, Datai)
Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)
Emem_instr =MinCostMem(InstrMem,Word_widthi) +7 * w(IAddri) + ß7 * h(IAddri-1, IAddri) +8 * w(IDatai) + ß8 * h(IDatai-1, IDatai)
Emem_data = MinCostMem (DataMem, Direction, Word_widthi)
+ 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)
Emem_data = MinCostMem (DataMem, Direction, Word_widthi)
+ 9 * w(DAddri) + ß9 * h(DAddri-1, DAddri) +10 * w(Datai) + ß10 * h(Datai-1, Datai)
- 19 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Results
It is not important, which address bit is set to ‚1‘
The number of ‚1‘s in the address bus is irrelevant
The cost of flipping a bit on the address bus is independent of the bit position.
It is not important, which data bit is set to ‚1‘
The number of ‚1‘s on the data bus has a minor effect (3%)
The cost of flipping a bit on the data bus is independent of the bit position.
- 20 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Hamming Distance between adjacent addresses is playing a major role
h-costs, address bus, CPU + memory current
110
115
120
125
130
135
140
145
150
0 5 10 15 20
Hamming-Distance
Cu
rren
t [m
A]
Read
Write
- 21 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Source of energy models2) computer models, e.g. CACTI
Comparison with SPICE
Cache model used
http://research.compaq.com/wrl/people/jouppi/CACTI.html
- 22 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Register allocation
Registers = fastest level in the memory hierarchy Interest in good global register allocation techniques
Frequently based on coloring of interference graph
Registers neither suitable for |objects| ≫ single words nor for code
Register allocation not considered in the remainder
lifetimes
v1
v2v3
v1
v2
v3
- 23 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable
Address space
ARM7TDMI cores, well-known for low power consumption
scratch pad memory
0
FFF..
ExampleExample
Small; no tag memory
SPMs are small, physically separate memories mapped into the address space;
Selection is by an appropriate address decoder (simple!)
SPM
select
- 24 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Predictability and scratch-pad memories
… In essence, we must reinvent computer science. Fortunately, we have quite a bit of knowledge and experience to draw upon. Architecture techniques such as software-managed caches promise to deliver much of the benefit of memory hierarchy without the timing unpredictability.
[Ed Lee: Absolutely Positively on Time: What would it take?, IEEE Computer, 2005]
… pre-run-time scheduling is often the only practicalmeans of providing predictability in a complex system.
[J. Xu, D. Parnas: On satisfying timing constraints in hard real-time systems, IEEE Trans. Soft. Engineering, 1993, p. 70–84]
- 25 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Comparison of currents using measurements
E.g.: ATMEL board with ARM7TDMI andext. SRAM
Current32 Bit-Load Instruction (Thumb)
48,2 50,9 44,4 53,1
11677,2 82,2
1,16
0
50
100
150
200
Prog Main/ DataMain
Prog Main/ DataSPM
Prog SPM/ DataMain
Prog SPM/ Data SPM
mA
Core+SPM (mA) Main Memory Current (mA)
- 26 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Scratchpad vs. main memory energy
Energy32 Bit-Load Instruction (Thumb)
115,8
51,6
76,5
16,4
0,020,040,060,080,0
100,0120,0140,0
Prog Main/DataMain
Prog Main/ DataSPM
Prog SPM/ DataMain
Prog SPM/Data SPM
nJ
Energy
Example: Atmel ARM-Evaluation board Savings (86%) even larger.
energy reduction:/ 7.06
100% predictable
energy reduction:/ 7.06
100% predictable
€
- 27 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Why not just use a cache ?
0
1
2
3
4
5
6
7
8
9
256 512 1024 2048 4096 8192 16384
memory size
En
erg
y p
er
ac
ce
ss
[n
J]
.
Scratch pad
Cache, 2way, 4GB space
Cache, 2way, 16 MB space
Cache, 2way, 1 MB space
Energy consumption in tags, comparators and muxes is significant.
[R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, P. Marwedel. Scratchpad Memory : A Design Alternative for Cache On-chip memory in Embedded Systems, Intern. Workshop on Hardware/ Software Codesign (CODES), 2002]
- 28 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Set-associative cachen-way cache
Address
Data
Tag Index
Tags data blockTags
= =
n = 2
1
data block
way 0 way 1$ (€)
- 29 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Influence of the associativity
Technology different from previous slide.
- 30 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Availability of SPMs=(“Tightly Coupled Memories”)
Source: http://www.arm.com/products/CPUs/core_selector.html
ARM CPU Core Caches TCM AvailableARM 1026EJ-S Variable yes
ARM 1136J(F)-S Variable yes
ARM 1176JZ(F)-S Variable yes
ARM 926EJ-S Variable yes
ARM 1026EJ-S Variable yes
ARM 1156T2(F)-S Variable yes
ARM 946E-S Variable yes
ARM 966E-S - yes
ARM 968E-S - yes
All others no
- 31 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Current usage for ARM
1. Use pragma in C-source to allocate to specific section:For example:#pragma arm section rwdata = "foo", rodata = "bar" int x2 = 5; // in foo (data part of region)int const z2[3] = {1,2,3}; // in bar
2. Input scatter loading file to linker for allocating section to specific address range
http://www.arm.com/documentation/ Software_Development_Tools/index.html
(2 d
iffe
rent
exa
mp
les)
LOAD_ROM_1 0x000{
EXEC_ROM_1 0x000{
}program1.o (+RO)
program1.o (+RW,+ZI)
DRAM 0x18000 0x8000{
}
Scatter description
Load region descriptionExecution region description
Input section description
Input section description
Execution region description
- 32 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Popular among designers?
Information received from designers:
Used for buffering speech in mobile phones(mobile phone company)
Manual mapping of frequently accessed data
Essentially no idea on how to exploit it (WLAN/Bluetooth specialists & major vendor)
Why not change this situation?
- 33 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Similar Conceptfor the Cell processor *
Motivation same as for this tutorial: Large memory latency Huge overhead for
automatically managed caches
Similar for Infineon TriCore
Local SPE processors fetch instructions and data from local storage LS (256 kB).LS not designed as a cache. Separate DMA transfers required to fill and spill.
* Sony, IBM, Toshiba
Main Memory
- 34 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Infineon TriCore
© Infineon, 2005
- 35 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Migration of instructions to small memoryYasuura’s architecture (Kyushu U.)
e
2 instruction memories: main memory + “decompressor” memory;Compiler optimizing code allocation & size of the 2 instruction memories
2 instruction memories: main memory + “decompressor” memory;Compiler optimizing code allocation & size of the 2 instruction memories
Merged instruction
Small andLow Power
Large andHigh Power
[T. Ishihara, H. Yasuura: A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors, Design and Automation in Europe Conference (DATE), 2000]
- 36 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
[Based on slide by & © : H. Yasuura, 2000]
Energy savings
- 37 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Migration of data and instructions, global optimization model (TU Dortmund)
Which object (array, loop, etc.) to be stored in SPM?
Non-overlaying memory allocation:
Gain gk & size sk for each object k.
Maximise gain G = gk, respecting size of SPM SSP sk.
Solution: Knapsack algorithm.
Overlaying allocation:
Moving objects back and forth between hierarchy levelsProcessor
Scratch pad memory,capacity SSP
main memory
?
For i .{ }
for j ..{ }
while ...
Repeat
function ...
Array ...
Int ...
Array
Example:
- 38 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
ILP representation- migrating functions and variables-
Symbols:
S(vark ) = size of variable k
nk = number of accesses to variable k
e(vark ) = energy saved per variable access, if vark is migrated
E(vark ) = energy saved if variable vark is migrated (= e(vark ) n(vark ))
x(vark ) = decision variable, =1 if variable k is migrated to SPM, =0 otherwise
K = set of variables
Similar for functions I
Integer linear programming formulation:
Maximize kK x(vark ) E(vark ) + iI x(Fi ) E(Fi )
Subject to the constraint
i I S (Fi ) x(Fi ) + k K S (vark ) x(vark ) SSP
- 39 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Reduction in energy and average run-time
Multi_sort (mix of sort algorithms)
Cyc
les
[x1
00
] E
ner
gy
[µJ]
Feasible with standard compiler & pre- or postpass optimization
Measured processor / external memory energy + CACTI values for SPM (combined model)
Numbers will change with technology, algorithms remain unchanged.
- 40 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Allocation of basic blocks
Fine-grained granularity smoothens dependency on the size of the scratch pad.
Requires additional jump instructions to return to "main" memory.
Fine-grained granularity smoothens dependency on the size of the scratch pad.
Requires additional jump instructions to return to "main" memory.
Main memory
BB1
BB2
Jump1
Jump2
Jump4
Jump3
For consecutive basic blocks
Statically 2 jumps,but only one is taken
- 41 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Taking consecutive basic blocksinto account
Approach:
• Consider sets of consecutive BBs as a new kind of basic blocks (“multi blocks”)
• Add a constraint preventing the same block from being selected twice:x(BBb ) + x(Fi ) + j multiblocks(b), j x x(BBj ) 1
b {blocks} {multi blocks}
Block b is either moved individually, as part of a function, as part of one of its enclosing multi-blocks or not at all.
BB1
BB3
BB2
BB12
BB23
BB123
- 42 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Allocation of basic blocks, sets of adjacent basic blocks and the entire stack
Requires generation of additional jumps (special compiler) or procedure exlining (“procedural abstraction”)
Cyc
les
[x1
00
] E
ner
gy
[µJ]
- 43 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Savings for memory system energy alone
Combined model for memories
- 44 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Underlying reasons for large energy savings
Potential energy saving is determined by the difference between the energy consumption for large and small memories.
In this case:
factor of 20 savings of up to 95%
Can even be larger for larger memories
- 45 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Summary
Multiple objectives for the design of the memory system
• Energy efficiency
• (predictable) performance, cost, …
• small is beautiful (efficient) efficiency(memory hierarchies)>efficiency(single mem)
Energy models: based on measurements+toolsScratch pads: energy-efficient, fast, timing predictableScratch pad allocation strategies:
• Static (non-overlaying)- Globals,
- functions, basic blocks
- 46 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2008
TU Dortmund
Brief break (if on schedule)
Q&A?