A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy
description
Transcript of A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy
A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy Kaushal Sanghai
David KaeliECE DepartmentNortheastern UniversityBoston, MA
Outline Motivation and goals Blackfin 53x memory architecture L1 code memory configurations Code layout algorithm PGO linker tool Methodology Results Conclusions and future work References
Motivation
Blackfin processor cores provide highly configurable memory subsystems to better match application-specific workload characteristics
Spatial and temporal locality present in applications should be exploited to produce efficient layouts
Code and data layout can be optimized by profile guidance
Motivation
Most developers rely on hand tuning the layout which not only increases the time-to-market embedded products but also results in an inefficient memory mapping
Program optimization techniques to automatically optimize memory layout for such memory subsytems are thereby needed
Goals
Develop a complete code-mapping framework that provides for automatic code layout for the range of L1 memory configurations available on Blackfin
Create tools that enable fast and easy design space exploration across the range of L1 memory configurations
Utilize execution profiles to tune code layout Evaluate performance of the code mapping algorithms
on the available L1 memory configurations for embedded multimedia applications
Memory Architecture
Core
10-12 system clock cycles
Single cycle
L1 Instruction Memory
SDRAM (External)
4x(16 – 128 MB)
An optionalL2 SRAM
SRAM
Cache
SRAM/Cache
Memory Architecture
Core
10-12 system clock cycles
Single cycle
L1 Instruction Memory
SDRAM (External)
4x(16 – 128 MB)
An optionalL2 SRAM
SRAM
Cache
SRAM/Cache
L1 SRAM Configuration
Memory Architecture
Core
10-12 system clock cycles
Single cycle
L1 Instruction Memory
SDRAM (External)
4x(16 – 128 MB)
An optionalL2 SRAM
SRAM
Cache
SRAM/Cache
L1 Cache Configuration
Memory Architecture
Core
10-12 system clock cycles
Single cycle
L1 Instruction Memory
SRAM
Cache
SRAM/Cache
L1 SRAM/Cache Configuration
SDRAM (External)
4x(16 – 128 MB)
An optionalL2 SRAM
Tradeoffs Involved L1 SRAM
Most of the cache misses are avoided by mapping most frequently executed code (i.e., hot code) or critical code sections in SRAM
Performance can suffer if all of the hot code cannot be mapped to L1 SRAM
L1 Cache Exploits temporal locality in code May increase external memory bandwidth requirements Performance can suffer if application has poor temporal
locality L1 SRAM/Cache
Mapping hot sections in L1 SRAM reduces external memory bandwidth requirements
Cache provides low latency access to infrequent code
Code Layout Algorithms
Memory configuration Greedy algorithms implemented within the framework
L1 SRAM Greedy sort to solve the Knapsack problem
L1 SRAM & Cache Greedy heuristics to solve the Graph Coarsening problem
L1 SRAM Layout
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of L1 SRAM space)
8
Why Knapsack?
L1 SRAM Layout
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space)
8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
Why Knapsack?
L1 SRAM Layout
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space)
8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
Why Knapsack?
L1 SRAM Layout
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space)
8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
Why Knapsack?
Efficient L1 SRAM Layout
.....(1) 1
)(
n
iEiMax
.......(2) memory L1 1
n
iSi
Where, is the execution percentage of the code section i relative to the entire execution is the size of code section i)( iS
)( iE
This is an NP-complete problem!
Efficient Cache Layout
[Hashemi & Kaeli’98]
Nodes – functions
Edge weight– calling frequency
Each color represents a cache line.
Functions mapped to the same color conflict
F
G
CB
D
300
30 50
200
2
H
A
E
50
50
Efficient Cache Layout
[Hashemi & Kaeli’98]
Improved Mapping
Nodes – functions
Edge weight– calling frequency
Each color represents a cache line.
Functions mapped to the same color conflict
F
G
CB
D
300
30 50
200
2
H
A
E
50
50
F
G
CB
D
300
30 50
200
2
H
A
E
50
50
Efficient L1 SRAM/Cache layout Partition code into sections to be placed in L1
SRAM and L1 Cache L1 SRAM mapping
Maximize the amount of execution from L1 SRAM Map functions with low temporal locality in L1 SRAM Solve the Knapsack for all functions based on the
execution percentage, size and temporal reuse distance L1 Cache mapping
Of the remaining functions merge frequently executed caller/callee function pairs and map into contiguous memory locations
Algorithm Inputs
Execution percentage and size Weighted Call Graph Temporal re-use distance (RUD) for every function
Algorithm
L1 SRAM mapping
Step 1: Filter out functions with less that 1% of execution percentage
Step 2: Compute (Execution %/Size) /RUD for the remaining functions
Step 3: Solve the Knapsack problem and map the solution to the L1 SRAM space
Algorithm
L1 cache mapping
Step 4: Form the call graph of the remaining functions and sort by edge weights
Step 5: Set the threshold on max merged node size (MNsize); this is equal to the size of one way of the cache
Step 6: For all edges in the sorted list start merging nodes until merged node size <= MNsize
AlgorithmLet A and B be the nodes connected to an edge and SA and SB be their corresponding sizes. We would have 4 cases based on the merged node assignment of the nodes connected to an edge
Step 7:
case 1: A and B merged node && SA + SB < MNsizemerge A and B and assign a common merged node id
case 2: A merged node and B merged node if (SA + SB < MNsize) then merge B with A
else proceed to the next edgecase 3: B merged node and A merged node
same as in case 2 but swap A with B and B with Acase 4: A and B merged node
if total size of two merged nodes is less than MNsizemerged the two merged nodes to form a bigger node
else proceed to the next edge
Step 8: Map the resulting merged nodes in contiguous memory locations; starting with the merged node containing the heaviest edge
PGO Linker FrameworkProgram application
PGO Linker Framework
Read function symbol module
Program application
PGO Linker Framework
Read function symbol module
Gather profile information module
Program application
PGO Linker Framework
Read function symbol module
Gather profile information module
Program applicationInstrumentation
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Program applicationInstrumentation
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Program application
Call trace processing module
Instrumentation
Function call trace and temporal reuse distance
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Program application
Call trace processing module
Instrumentation
Call graphReuse distance
EP/size
Function call trace and temporal reuse distance
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Code layout module
Program application
Call trace processing module
Instrumentation
Call graphEP/size
Function call trace and temporal reuse distance
Call graphReuse distance
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Code layout module
Program application
Generate linker directive file
Call trace processing module
Instrumentation
EP/size
Function call trace and temporal reuse distance
Call graphReuse distance
PGO Linker Framework
Read function symbol module
Gather profile information module
Program instrumentation module
Code layout module
Program application
Generate linker directive file
Call trace processing module
Instrumentation
Relink the application
EP/size
Function call trace and temporal reuse distance
Call graphReuse distance
Methodology
Benchmark Code Size (KB) # of functionsJPEG2 encoder 56 380
JPEG2 decoder 61 388
MPEG2 encoder 84 330
MPEG2 decoder 68 351
MPEG4 encoder 197 480
MPEG4 decoder 131 404
Evaluated the algorithms on six consumer benchmark programs from the EEMBC suite
Methodology
Configured L1 memory as L1 SRAM/Cache for all the benchmarks
All experiments are performed on the Blackfin 533 EZ-Kit hardware board
4 different L1 memory configurations considered 12K L1 - divided as 8K SRAM and 4K Cache 16K L1 - divided as 12K SRAM and 4K Cache
and compared to 16K L1 - divided as 8K SRAM and 8K Cache and
compared to 80K L1 - divided as 64K SRAM and 16K Cache
ResultsBaseline - 8KS-4KC - No opt
0
5
10
15
20
25
30
35
40
Benchmark Program
% I
mp
rove
men
t in
cyc
les
over
bas
elin
e
ME
KS
KS-TR
ME-GP
KS-GP
KS-TR-GP
ResultsAverage % improvement for different L1 SRAM/Cache size
0
5
10
15
20
25
8KS-4KC 12KS-4KC 8KS-8KC
L1 SRAM/Cache size
Avg
% I
mp
rove
men
t in
cyc
les
for
al
l ben
chm
ark
s
ME
KS
KS-TR
ME-GC
KS-GC
KS-TR-GC
ResultsPerformance in cycles (in 10 million)
0
10
20
30
40
50
60
70
80
Benchmark Program
cycl
es (
in 1
0 m
illi
on)
8KS-4KC-No-Opt
8KS-4KC-Full-Opt
12KS-4KC-No Opt
12KS-4KS-Full-Opt
8KS-8KC-No-Opt
8KS-8KC-Full-Opt
64KS-16KC-No-Opt64KS-16KC-FullOpt
Code development
Debug successful
Program optimization
System design
Compiler optimization and/or Profile guided compiler optimizations
Evaluate L1 memory configurations and size within the PGO linker framework
Enhanced System Implementation Cycle
Features of the framework
Process is completely automated Gather profiles Generate dynamic function call graphs Run optimization algorithms Re-linking the project for improved layout
Can be used with hardware, compiled simulation or cycle accurate simulation sessions in the VisualDSP++ development environment for BFxxx
Code mapping at the function level granularity Efficient in run time
Conclusion
We have developed a completely automated and efficient code layout framework for a configurable L1 code memory supported by the BFxxx
We show a minimum of 3% to a maximum of 33% performance improvement (20% on average) for the six benchmark programs with a 12K L1 memory
We show that by efficiently mapping code, a 16K L1 memory results in a similar performance as a 80K of L1 memory
Future work The mapping can be extended to basic block
granularity Code mapping to avoid external memory bank
contention (SDRAM) can be incorporated Code layout techniques for multi-core
architectures can be developed; considering shared memory accesses
The framework can be extended to data layout techniques
References Kaushal Sanghai, David Kaeli and Richard Gentile,
“Code and Data Partitioning on Blackfin for partitioned multimedia benchmark programs”, In the Proceedings of the 2005 Workshop on Optimizations for DSP and Embedded Systems, Mar-2005
Kaushal Sanghai, David Kaeli, Alex Raikman and Ken Butler, “A Code Layout Framework for Configurable Memory Systems in Embedded processors”, General Technical Conference, Analog Devices Inc., Jun-2006
Command Line Interface PGOLinker <dxefile> <linker directive output file(.asm)> -multicore –
algorithmSample OutputAlgorithm Selected--> KNAPSACK
Connecting to the IDDE and loading ProgramConnection to the IDDE established and Program loaded
Gathering the function symbol informationFunction symbol information obtained
No existing profile session. A new profile session will be created
Application Running.Processor Halted
Getting profile InformationAnalyzing the profile information obtainedAnalysis Done
Total sample count collected is --> 905The total execution from L1 for 4KB of L1 is 98.232%Total functions in L1 14--------------------------------------------------------------------------------The total execution from L1 for 8KB of L1 is 100%Total functions in L1 22