Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 -...
Transcript of Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 -...
![Page 1: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/1.jpg)
fakultät für informatikinformatik 12
technische universität dortmund
Optimizations
Peter MarwedelTU DortmundInformatik 12
Germany
2009/01/17
Gra
phic
s: ©
Ale
xand
ra N
olte
, Ges
ine
Mar
wede
l, 20
03
![Page 2: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/2.jpg)
- 2 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Structure of this course
2: Specifications
3: Embedded System HW
4: Standard Software, Real-Time Operating Systems
5: Scheduling,HW/SW-Partitioning, Applications to MP-Mapping
6: Evaluationand Validation
8: Testing
7: Optimization of Embedded Systems
Appl
icat
ion
Kno
wle
dge
![Page 3: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/3.jpg)
- 3 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Hardware-support for block-copying
The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip.The unit can be put to sleep when it is unused.
Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the overlaying allocation that uses processor instructions for copying.
DMA Scratch-pad ProcessorMemory
[Lars Wehmeyer, Peter Marwedel: Fast, Efficient and Predictable Memory Accesses, Springer, 2006]
![Page 4: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/4.jpg)
- 4 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
References to large arrays (1)- Regular accesses -
for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) U[i][j]=U[i][j] + V[i][k] * W[k][j]
Tiling
[M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, A. Parikh: Dynamic Management of Scratch-Pad Memory Space, DAC, 2001, pp. 690-695]
for (it=0; it<n; it=it+Sb) {read_tile V[it:it+Sb-1, 1:n] for (jt=0; jt<n; jt=jt+Sb) {read_tile U[it:it+Sb-1, jt:jt+Sb-1] read_tile W[1:n,jt:jt+Sb-1] U[it:it+Sb-1,jt:jt+Sb-1]=U[it:it+Sb-1,jt:jt+Sb-1] + V[it:it+Sb-1,1:n] * W [1:n, jt:jt+Sb-1] write_tile U[it:it+Sb-1,jt:jt+Sb-1] }}
![Page 5: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/5.jpg)
- 5 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
References to large arrays- Irregular accesses -
for each loop nest L in program P { apply loop tiling to L based on the access patterns of regular array references; for each assignment to index array X update the block minimum and maximum values of X; compute the set of array elements that are irregularly referenced in the current inter-tile iteration; compare the memory access costs for using and not using SPM; if (using SPM is beneficial) execute the intra-tile loop iterations by using the SPM else execute the intra-tile loop iterations by not using the SPM}
[G. Chen, O. Ozturk, M. Kandemir, M. Karakoy: Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns, DATE, 2006]
![Page 6: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/6.jpg)
- 6 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for irregular approach
Cache
Kandemir@DAC01
Kandemir@DATE06
![Page 7: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/7.jpg)
- 7 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Hierarchical memories: Memory hierarchy layer assignment (MHLA) (IMEC)
Partition n.1
Processor
Partition 1.1SPM-module 1.1.1
SPM-module 1.1.2
Partition 1.2Cache-module 1.2.1
Cache-module 1.2.2
Partition 2.1 Partition 2.2
n la
yers
with
"par
titio
ns"
cons
istin
g of
mod
ules …
[E. Brockmeyer et al.: Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations, Design, Automation and Test in Europe (DATE), 2003.]
![Page 8: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/8.jpg)
- 8 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Copy candidates -
int A[250]A'[0..99]=A[0..99];for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])size=100; reads(A)=100
int A[250]for (i=0; i<10; i++) {A'[0..99]=A[0..99]; for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A'[j*10+l])}size=100;reads(A)=1000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) {A"[0..9]=A[j*10..j*10+9]; for (k=0; k<10; k++) for (l=0; l<10; l++) f(A"[l])}size=10; reads(A)=1000
int A[250]for (i=0; i<10; i++) for (j=0; j<10; j++) for (k=0; k<10; k++) for (l=0; l<10; l++) f(A[j*10+l])size=0; reads(A)=10000
10 100
100
1000
10000
size
reads(A)
A', A" in small memory
Copy candidate
![Page 9: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/9.jpg)
- 9 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Goal -
Goal: For each variable: find permanent layer, partition and module & select copy candidates such that energy is minimized.
Conflicts between variables
[E. Brockmeyer et al.: Layer Assignment Techniques for Low Energy in Multi-Layered Memory Organisations, Design, Automation and Test in Europe (DATE), 2003.]
![Page 10: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/10.jpg)
- 10 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Memory hierarchy layer assignment (MHLA)- Approach -
More general hardware architecture than the Dortmund approach, but no global optimization.
Approach: start with initial variable allocation incrementally improve initial solution
such that total energy is minimized.
NOT assignedcopy candidates
assignedcopy candidates Platform
A’
A
A”
L3
L2
L1L0
1250
11000
1000
10000
250
Current assignmentNOT assignedcopy candidates
assignedcopy candidate Platform
A’
A
A”
L3
L2
L1L0
1250
11000
1000
10000
250
Next assignment
100step 1350
step2
1100
100
![Page 11: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/11.jpg)
- 11 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Dynamic set of multiple applications
MEM
CPU
SPM
SPMManager
App. 2
App. 1
App. n
SP
M
App. 3
App. 2
App. 1t
Address space:
SPM
?
Compile-time partitioning of SPM no longer feasibleIntroduction of SPM-managerRuntime decisions, but
compile-time supported
[R. Pyka, Ch. Faßbach, M. Verma, H. Falk, P. Marwedel: Operating system integrated energy aware scratchpad allocation strategies for multi-process applications, SCOPES, 2007]
![Page 12: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/12.jpg)
- 12 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Approach overview
App. 2
App. 1
App. n AllocationManager
Standard Compiler
(GCC)
OperatingSystem
Compile-timeTransformations
Profit values / Allocation hints
2 steps: compile-time analysis & runtime decisions No need to know all applications at compile-time Capable of managing runtime allocated memory objects Integrates into an embedded operating system
Using MPArm simulator from U. Bologna
![Page 13: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/13.jpg)
- 13 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results
MEDIA+ Energy Baseline: Main memory only Best: Static for 16k 58% Overall best: Chunk 49%
MEDIA+ Cycles Baseline: Main memory only Best: Static for 16k 65% Overall best: Chunk 61%
![Page 14: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/14.jpg)
- 14 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Comparison of SPMM to Caches for SORT
Baseline: Main memory onlySPMM peak energy reduction by
83% at 4k Bytes scratchpadCache peak: 75% at 2k 2-way
cache
SPMM capable ofoutperforming caches
OS and libraries are not considered yet
Chunk allocation results:
SPM Size Δ 4-way1024 74,81%
2048 65,35%
4096 64,39%
8192 65,64%
16384 63,73%
![Page 15: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/15.jpg)
- 15 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
SPM+MMU (1)
How to use SPM in a system with virtual addressing? Virtual SPM
Typically accesses MMU + SPM in parallel not energy efficient
Real SPM suffers from potentiallylong VA translation
Egger, Lee, Shin (Seoul Nat. U.):Introduction of small µTLB translating recent addresses fast.
[B. Egger, J. Lee, H. Shin: Scratchpad memory management for portable systems with a memory management unit, CASES, 2006, p. 321-330 (best paper)]
Proc. $SPM MMU
![Page 16: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/16.jpg)
- 16 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
SPM+MMU (2)
µTLB generates physical address in 1 cycle
if address corresponds to SPM, it is used
otherwise, mini-cache is accessed
Mini-cache provides reasonable performance for non-optimized code
µTLB miss triggers main TLB/MMU
SPM is used only for instructions instructions are stored in pages pages are classified as
cacheable, non-cacheable, and “pageable”(= suitable for SPM)
CPU core
MMU
unified TLB
minicache
TAG RAM
DATA RAMSPM
SPM base reg. comparator
µTLB
instruction VA
PA
![Page 17: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/17.jpg)
- 17 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
SPM+MMU (3)
Application binaries are modified: frequently executed code put into pageable pages.
Initially, page-table entries for pageable code are marked invalid
If invalid page is accessed, a page table exception invokes SPM manager (SPMM).
SPMM allocates space in SPM and sets page table entry
If SPMM detects more requests than fit into SPM, SPM eviction is started
Compiler does not need to know SPM size virtual memory
SPM
main memory
physical memory
stack/heapregion
pageable region
cached region
uncached region
pageable region
cached region
uncached region
PC
![Page 18: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/18.jpg)
- 18 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Extension to SNACK-pop(post-pass optimization)
H. Cho, B. Egger, J. Lee, H. Shin: Dynamic Data Scratchpad Memory Management for a Memory Subsystem with an MMU, LCTES, 2007
objectfiles libraries
disassemble
profile code generation
building dynamiccall graph
cloning functions
inserting SPMmanager calls
executable image generation
ILP solver
dynamic call graph
profiling image
architecture simulator
profile data
executable image
input data
![Page 19: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/19.jpg)
- 19 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Cloning of functions
q
m n
g
f
q
m n
g
f
g’
Computation of which block should be moved in and out for a certain edge
Generation of an ILP Decision about copy operations at compile time.
![Page 20: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/20.jpg)
- 20 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for SNACK-pop (1)
© ACM, 2007
![Page 21: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/21.jpg)
- 21 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for SNACK-pop (2)
© ACM, 2007
![Page 22: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/22.jpg)
- 22 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Multi-processor ARM (MPARM) Framework
Homogenous SMP ~ CELL processor Processing Unit : ARM7T processor Shared Coherent Main Memory Private Memory: Scratchpad Memory
SPM SPMSPMSPM
InterruptDevice
SemaphoreDevice
ARM ARM ARM ARM
Interconnect (AMBA or STBus)
Shared Main Memory
![Page 23: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/23.jpg)
- 23 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Application Example: Multi-Processor Edge Detection
Source SinkCompute Processors
Source, sink and n compute processors Each image is processed by an independent compute processor
• Communication overhead is minimized.
![Page 24: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/24.jpg)
- 24 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results:Scratchpad Overlay for Edge Detection
0
500
1000
1500
2000
2500
128 256 512 1024 2048 4096 8192 16384
Scratchpad Size per Processor (Bytes)
Tota
l Ene
rgy
Cons
umpt
ion
(mJ)
Energy (SO): 1 CP
Energy (SO): 2 CP
Energy (SO): 3 CP
Energy (SO): 4 CP
2 CPs are better than 1 CP, then energy consumption stabilizes Best scratchpad size: 4kB (1CP& 2CP) 8kB (3CP & 4CP)
![Page 25: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/25.jpg)
- 25 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
ResultsDES-Encryption
0
10
20
30
40
50
60
70
0 128 256 512 1k 2k 4k 8k 16k
Scratchpad Size [Bytes]
Ener
gy C
onsu
mpt
ion
[µJ]
Energy Consumption [µJ]
DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines
Energy values from ST Microelectronics
Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.
![Page 26: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/26.jpg)
- 26 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
MPSoC with shared SPMs
[M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, I. Kolcu: Compiler-Directed Scratch Pad Memory Optimization for Embedded Multiprocessors, IEEE Trans. on VLSI, Vol. 12, 2004, pp. 281-286]
© IEEE, 2004
![Page 27: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/27.jpg)
- 27 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Energy benefits despite large latenciesfor remote SPMs
© IEEE, 2004
DRAM: 80 cycles
![Page 28: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/28.jpg)
- 28 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Extensions
Using DRAM Applications to Flash memory
(copy code or execute in place):according to own experiments: very much parameter dependent
Trying to imitate advantages of SPM with caches: partitioned caches, etc.
PhD thesis of Lars Wehmeyer
![Page 29: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/29.jpg)
fakultät für informatikinformatik 12
technische universität dortmund
Optimizationsfor Caches
Peter MarwedelTU DortmundInformatik 12
Germany
2009/01/17
Gra
phic
s: ©
Ale
xand
ra N
olte
, Ges
ine
Mar
wede
l, 20
03
![Page 30: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/30.jpg)
- 30 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Improving predictability for caches
Loop caches Mapping code to less used part(s) of the index space Cache locking/freezing Changing the memory allocation for code or data Mapping pieces of software to specific ways
Methods:- Generating appropriate way in software- Allocation of certain parts of the address space to a
specific way- Including way-identifiers in virtual to real-address
translation “Caches behave almost like a scratch pad”
![Page 31: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/31.jpg)
- 31 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (1)
Execution counts based approach: Sort the functions according to execution counts
f4 > f1 > f2 > f5 > f3
Place functions in decreasing order of execution counts
f1
f2
f3
f4
f5
(1100)
(900)
(400)
(2000)
(700)
[S. McFarling: Program optimization for instruction caches, 3rd International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 1989]
![Page 32: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/32.jpg)
- 32 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (2)
Execution counts based approach: Sort the functions according to execution counts
f4 > f1 > f2 > f5 > f3
Place functions in decreasing order of execution counts
Transformation increases spatial locality.Does not take in account calling order
f4
(1100)
(700)
(400)
(2000)
(900)
f4
f1
f2
f5
f3
f4
f2 f5
f1 f3
![Page 33: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/33.jpg)
- 33 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (3)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(2000) f4
f4
f2 f5
f1 f3 [W. W. Hwu et al.: Achieving high instruction cache performance with an optimizing compiler, 16th Annual International Symposium on Computer Architecture, 1989]
![Page 34: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/34.jpg)
- 34 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (3)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(900)
(2000) f4
f2
f4
f2 f5
f1 f3
![Page 35: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/35.jpg)
- 35 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (4)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Increases spatial locality.
(900)
(1100)
(2000) f4
f2
f1
f4
f2 f5
f1 f3
![Page 36: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/36.jpg)
- 36 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (5)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Increases spatial locality.
f4
f2 f5
f1 f3
f4
(900)
(1100)
(400)
(2000) f4
f2
f1
f3
![Page 37: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/37.jpg)
- 37 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Code Layout Transformations (6)
Call-Graph Based Algorithm: Create weighted call-graph. Place functions according to weighted depth-first
traversal.f4 > f2 > f1 > f3 > f5
Combined with placing frequently executed traces at the top of the code space for functions.
Increases spatial locality.f4
(900)
(1100)
(400)
(2000)
(700)
f4
f2
f5
f1
f3f4
f2 f5
f1 f3
![Page 38: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/38.jpg)
- 38 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Way prediction/selective direct mapping
[M. D. Powell, A. Agarwal, T. N. Vijaykumar, B. Falsafi, K. Roy: Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct-Mapping, MICRO-34, 2001]
© ACM
![Page 39: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/39.jpg)
- 39 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Hardware organization for way prediction
© ACM
![Page 40: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/40.jpg)
- 40 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for the paper on way prediction (1)
0.06Tag array energy (incl. in the above numbers)
0.0071024x4bit prediction table read/write
0.24Cache write
0.211 way read
1.00Parallel access cache read (4 ways read)
Relative energy
Energy component
Cache energy andprediction overhead
© ACM
2-level hybridBranch predictor32LSQ size64Reorder buffer size
80 cycles+4 cycles per 8 bytes
Memory access latency
1M, 8-way, 12 cycle latency
L2 cache
16K, 4-way, 1 or 2 cycles, 2ports
Base L1 D-Cache
16K, 4-way, 1 cycle
L1 I-Cache
8 issues per cycleInstruction issue & decode bandwidth
System configuration parameters
![Page 41: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/41.jpg)
- 41 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for the paper on way prediction (2)
© ACM
![Page 42: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/42.jpg)
- 42 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Results for the paper on way prediction (2)
© ACM
![Page 43: Optimizations - TU Dortmund · Cache Kandemir@DAC01 Kandemir@DATE06. technische universität - 7 - dortmund fakultät für ... Layer Assignment Techniques for Low Energy in Multi-Layered](https://reader034.fdocuments.us/reader034/viewer/2022042211/5eb0ed42caf27c6b37246648/html5/thumbnails/43.jpg)
- 43 -technische universitätdortmund
fakultät für informatik
p. marwedel, informatik 12, 2009
TU Dortmund
Summary
Allocation strategies for SPM• Dynamic sets of processes• Multiprocessors• MMUs• Sharing between SPMs in a multi-processor
Optimizations for Caches• Code Layout transformations• Way prediction