A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy

A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy Kaushal Sanghai

David KaeliECE DepartmentNortheastern UniversityBoston, MA

Outline Motivation and goals Blackfin 53x memory architecture L1 code memory configurations Code layout algorithm PGO linker tool Methodology Results Conclusions and future work References

Motivation

Blackfin processor cores provide highly configurable memory subsystems to better match application-specific workload characteristics

Spatial and temporal locality present in applications should be exploited to produce efficient layouts

Code and data layout can be optimized by profile guidance

Motivation

Most developers rely on hand tuning the layout which not only increases the time-to-market embedded products but also results in an inefficient memory mapping

Program optimization techniques to automatically optimize memory layout for such memory subsytems are thereby needed

Goals

Develop a complete code-mapping framework that provides for automatic code layout for the range of L1 memory configurations available on Blackfin

Create tools that enable fast and easy design space exploration across the range of L1 memory configurations

Utilize execution profiles to tune code layout Evaluate performance of the code mapping algorithms

on the available L1 memory configurations for embedded multimedia applications

Memory Architecture

Core

10-12 system clock cycles

Single cycle

L1 Instruction Memory

SDRAM (External)

4x(16 – 128 MB)

An optionalL2 SRAM

SRAM

Cache

SRAM/Cache

Memory Architecture

Core


Single cycle


SDRAM (External)

4x(16 – 128 MB)

An optionalL2 SRAM

SRAM

Cache

SRAM/Cache

L1 SRAM Configuration

Memory Architecture

Core


Single cycle


SDRAM (External)

4x(16 – 128 MB)

An optionalL2 SRAM

SRAM

Cache

SRAM/Cache

L1 Cache Configuration

Memory Architecture

Core


Single cycle


SRAM

Cache

SRAM/Cache

L1 SRAM/Cache Configuration

SDRAM (External)

4x(16 – 128 MB)

An optionalL2 SRAM

Tradeoffs Involved L1 SRAM

Most of the cache misses are avoided by mapping most frequently executed code (i.e., hot code) or critical code sections in SRAM

Performance can suffer if all of the hot code cannot be mapped to L1 SRAM

L1 Cache Exploits temporal locality in code May increase external memory bandwidth requirements Performance can suffer if application has poor temporal

locality L1 SRAM/Cache

Mapping hot sections in L1 SRAM reduces external memory bandwidth requirements

Cache provides low latency access to infrequent code

Code Layout Algorithms

Memory configuration Greedy algorithms implemented within the framework

L1 SRAM Greedy sort to solve the Knapsack problem

L1 SRAM & Cache Greedy heuristics to solve the Graph Coarsening problem

L1 SRAM Layout

Objects A B C D E F G H

Value (execution freq) 25 20 15 10 7 3 2 1

Weight (code size) 2 5 2 2 2 1 1 2

Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5

Weight Bound (size of L1 SRAM space)

8

Why Knapsack?

L1 SRAM Layout

Objects A B C D E F G H

Value (execution freq) 25 20 15 10 7 3 2 1

Weight (code size) 2 5 2 2 2 1 1 2

Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5

Weight Bound (size of SRAM space)

8

Algorithm Objects Total value in Knapsack

Most Executed A,B,F 48

Optimal A,C,D,E 57

Greedy Sort A,C,D,E 57

Why Knapsack?

Efficient L1 SRAM Layout

.....(1) 1

)(

n

iEiMax

.......(2) memory L1 1

n

iSi

Where, is the execution percentage of the code section i relative to the entire execution is the size of code section i)( iS

)( iE

This is an NP-complete problem!

Efficient Cache Layout

[Hashemi & Kaeli’98]

Nodes – functions

Edge weight– calling frequency

Each color represents a cache line.

Functions mapped to the same color conflict

F

G

CB

D

300

30 50

200

2

H

A

E

50

50

Efficient Cache Layout

[Hashemi & Kaeli’98]

Improved Mapping

Nodes – functions

Edge weight– calling frequency

Each color represents a cache line.

Functions mapped to the same color conflict

F

G

CB

D

300

30 50

200

2

H

A

E

50

50

F

G

CB

D

300

30 50

200

2

H

A

E

50

50

Efficient L1 SRAM/Cache layout Partition code into sections to be placed in L1

SRAM and L1 Cache L1 SRAM mapping

Maximize the amount of execution from L1 SRAM Map functions with low temporal locality in L1 SRAM Solve the Knapsack for all functions based on the

execution percentage, size and temporal reuse distance L1 Cache mapping

Of the remaining functions merge frequently executed caller/callee function pairs and map into contiguous memory locations

Algorithm Inputs

Execution percentage and size Weighted Call Graph Temporal re-use distance (RUD) for every function

Algorithm

L1 SRAM mapping

Step 1: Filter out functions with less that 1% of execution percentage

Step 2: Compute (Execution %/Size) /RUD for the remaining functions

Step 3: Solve the Knapsack problem and map the solution to the L1 SRAM space

Algorithm

L1 cache mapping

Step 4: Form the call graph of the remaining functions and sort by edge weights

Step 5: Set the threshold on max merged node size (MNsize); this is equal to the size of one way of the cache

Step 6: For all edges in the sorted list start merging nodes until merged node size <= MNsize

AlgorithmLet A and B be the nodes connected to an edge and SA and SB be their corresponding sizes. We would have 4 cases based on the merged node assignment of the nodes connected to an edge

Step 7:

case 1: A and B merged node && SA + SB < MNsizemerge A and B and assign a common merged node id

case 2: A merged node and B merged node if (SA + SB < MNsize) then merge B with A

else proceed to the next edgecase 3: B merged node and A merged node

same as in case 2 but swap A with B and B with Acase 4: A and B merged node

if total size of two merged nodes is less than MNsizemerged the two merged nodes to form a bigger node

else proceed to the next edge

Step 8: Map the resulting merged nodes in contiguous memory locations; starting with the merged node containing the heaviest edge

PGO Linker FrameworkProgram application

PGO Linker Framework

Read function symbol module

Program application



Gather profile information module

Program application




Program applicationInstrumentation




Program instrumentation module

Program applicationInstrumentation





Program application

Call trace processing module

Instrumentation

Function call trace and temporal reuse distance





Program application


Instrumentation

Call graphReuse distance

EP/size






Code layout module

Program application


Instrumentation

Call graphEP/size







Code layout module

Program application

Generate linker directive file


Instrumentation

EP/size







Code layout module

Program application

Generate linker directive file


Instrumentation

Relink the application

EP/size



Methodology

Benchmark Code Size (KB) # of functionsJPEG2 encoder 56 380

JPEG2 decoder 61 388

MPEG2 encoder 84 330

MPEG2 decoder 68 351

MPEG4 encoder 197 480

MPEG4 decoder 131 404

Evaluated the algorithms on six consumer benchmark programs from the EEMBC suite

Methodology

Configured L1 memory as L1 SRAM/Cache for all the benchmarks

All experiments are performed on the Blackfin 533 EZ-Kit hardware board

4 different L1 memory configurations considered 12K L1 - divided as 8K SRAM and 4K Cache 16K L1 - divided as 12K SRAM and 4K Cache

and compared to 16K L1 - divided as 8K SRAM and 8K Cache and

compared to 80K L1 - divided as 64K SRAM and 16K Cache

ResultsBaseline - 8KS-4KC - No opt

0

5

10

15

20

25

30

35

40

Benchmark Program

% I

mp

rove

men

t in

cyc

les

over

bas

elin

e

ME

KS

KS-TR

ME-GP

KS-GP

KS-TR-GP

ResultsAverage % improvement for different L1 SRAM/Cache size

0

5

10

15

20

25

8KS-4KC 12KS-4KC 8KS-8KC

L1 SRAM/Cache size

Avg

% I

mp

rove

men

t in

cyc

les

for

al

l ben

chm

ark

s

ME

KS

KS-TR

ME-GC

KS-GC

KS-TR-GC

ResultsPerformance in cycles (in 10 million)

0

10

20

30

40

50

60

70

80

Benchmark Program

cycl

es (

in 1

0 m

illi

on)

8KS-4KC-No-Opt

8KS-4KC-Full-Opt

12KS-4KC-No Opt

12KS-4KS-Full-Opt

8KS-8KC-No-Opt

8KS-8KC-Full-Opt

64KS-16KC-No-Opt64KS-16KC-FullOpt

Code development

Debug successful

Program optimization

System design

Compiler optimization and/or Profile guided compiler optimizations

Evaluate L1 memory configurations and size within the PGO linker framework

Enhanced System Implementation Cycle

Features of the framework

Process is completely automated Gather profiles Generate dynamic function call graphs Run optimization algorithms Re-linking the project for improved layout

Can be used with hardware, compiled simulation or cycle accurate simulation sessions in the VisualDSP++ development environment for BFxxx

Code mapping at the function level granularity Efficient in run time

Conclusion

We have developed a completely automated and efficient code layout framework for a configurable L1 code memory supported by the BFxxx

We show a minimum of 3% to a maximum of 33% performance improvement (20% on average) for the six benchmark programs with a 12K L1 memory

We show that by efficiently mapping code, a 16K L1 memory results in a similar performance as a 80K of L1 memory

Future work The mapping can be extended to basic block

granularity Code mapping to avoid external memory bank

contention (SDRAM) can be incorporated Code layout techniques for multi-core

architectures can be developed; considering shared memory accesses

The framework can be extended to data layout techniques

References Kaushal Sanghai, David Kaeli and Richard Gentile,

“Code and Data Partitioning on Blackfin for partitioned multimedia benchmark programs”, In the Proceedings of the 2005 Workshop on Optimizations for DSP and Embedded Systems, Mar-2005

Kaushal Sanghai, David Kaeli, Alex Raikman and Ken Butler, “A Code Layout Framework for Configurable Memory Systems in Embedded processors”, General Technical Conference, Analog Devices Inc., Jun-2006

Command Line Interface PGOLinker <dxefile> <linker directive output file(.asm)> -multicore –

algorithmSample OutputAlgorithm Selected--> KNAPSACK

Connecting to the IDDE and loading ProgramConnection to the IDDE established and Program loaded

Gathering the function symbol informationFunction symbol information obtained

No existing profile session. A new profile session will be created

Application Running.Processor Halted

Getting profile InformationAnalyzing the profile information obtainedAnalysis Done

Total sample count collected is --> 905The total execution from L1 for 4KB of L1 is 98.232%Total functions in L1 14--------------------------------------------------------------------------------The total execution from L1 for 8KB of L1 is 100%Total functions in L1 22

A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy

Documents

Transcript of A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy