INTRODUCTION TO SCRATCH. About Me Resources Scratch Website Learn Scratch Washington-Lee Computer.
Scratch Pad
-
Upload
ingrid2915 -
Category
Documents
-
view
5 -
download
0
Transcript of Scratch Pad
![Page 1: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/1.jpg)
Outline
•Introduction•Different Scratch Pad Memories•Cache and Scratch Pad for embedded applications
![Page 2: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/2.jpg)
Memories in Embedded Systems Each memory has its own advantages
For better performance memory accesses have to be fast
CPU Internal ROM
InternalSRAM
External DRAM
![Page 3: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/3.jpg)
Efficient Utilization of Scratch-Pad Memory in Embedded Processor
Applications
![Page 4: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/4.jpg)
What is Scratchpad memory ?• Fast on-chip SRAM• Abbreviated as SPM• 2 types of SPM :-
Static SPM locations don’t change at runtime Dynamic SPM locations change at runtime
![Page 5: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/5.jpg)
Objective• Find a technique for efficiently exploiting on-
chip SPM by partitioning the application’s scalar and array variables into off-chip DRAM and on-chip SPM.
• Minimize the total execution time of the application.
![Page 6: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/6.jpg)
SPM and Cache• Similarities
Connected to the same address and data buses. Access latency of 1 processor cycle.
• Difference SPM guarantees single cycle access time while an
access to cache is subject to a miss.
![Page 7: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/7.jpg)
Block Diagram of Embedded Processor Application
![Page 8: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/8.jpg)
Division of Data Address Space between SRAM and DRAM
![Page 9: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/9.jpg)
Example: Histogram Evaluation Code• Builds a histogram of 256 brightness levels for the pixels of
an N* N image –
char Brightnesslevel [512] [512]; int Hist [256]; /* Elements initialized to 0 */ …for(i = 0;i < N;i+ +)
for (j = 0;j < N;j + +) /* For each pixel (i, j) in image */ level = BrightnessLevel [i] [j]; Hist [level] = Hist [level] + 1;
![Page 10: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/10.jpg)
Problem Description• If the code is executed on a processor
configured with a data cache of size 1Kb – performance will be degraded by conflict misses in the cache between elements of the 2 arrays Hist and BrightnessLevel.
• Solution:- Selectively map to SPM those variables that cause maximum number of conflicts in the data cache.
![Page 11: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/11.jpg)
Partitioning Strategy• Features affecting partitioning
Scalar variables and constantsSize of arraysLife-times of array variablesAccess frequency of array variablesConflicts in loops
• Partitioning Algorithm
![Page 12: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/12.jpg)
Features affecting partitioning• Scalar variables and constants
All scalar variables and scalar constants are mapped onto SPM.
• Size of Arrays Arrays that are larger than SRAM are mapped
onto off-chip memory.
![Page 13: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/13.jpg)
Features affecting partitioning
• Lifetime of an Array VariableDefinition :- period between its definition and its
last use. Variables with disjoint lifetimes can be stored in
the same processor register. Arrays with different lifetimes can share the same
memory space.
![Page 14: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/14.jpg)
Features affecting partitioning• Intersecting Life Times ILT(u)
Definition :- Number of array variables having a non-null intersection of lifetimes with u.
Indicates the number of other arrays it could possibly interact with, in cache.
So map arrays with highest ILT values into SPM, thereby eliminating a large number of potential conflicts.
![Page 15: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/15.jpg)
Features affecting partitioning• Access frequency of Array Variables
Variable Access Count VAC(u) Definition :- Number of accesses to elements
of u during its lifetime. Interference Access Count IAC(u) Definition :- Number of accesses to other
arrays during the lifetime of u. Interference Factor IF(u) = VAC(u)*IAC(u)
![Page 16: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/16.jpg)
Features affecting partitioning
b
c
a
3N 3N
Conflicts in Loops
for i = 0 to N-1 access a [i] access b [i] access c [2 i] access c [2 i + 1] end for
Loop Conflict GraphLCGedge weight e(u, v) = ∑p
i=1 k
i
ki ->total no. of accesses to u and v in loop i
Total no. of accesses to a and c combined : (1+2)*N = 3N =>e(a,c) = 3N ; e(b,c) = 3N ; e(a,b) = 0
![Page 17: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/17.jpg)
Features affecting partitioning• Loop Conflict Factor
Definition :- sum of incident edge weights to node u.
LCF(u) = ∑v є LCG - {u}
e(u,v)
Higher the LCF, more conflicts are likely for an array, more desirable to map the array to the SPM.
![Page 18: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/18.jpg)
Partitioning Strategy• Features affecting partitioning
Scalar variables and constantsSize of arraysLife-times of array variablesAccess frequency of array variablesConflicts in loops
• Partitioning Algorithm
![Page 19: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/19.jpg)
Partitioning Algorithm• Algorithm for determining the mapping
decision of each(scalar and array) program variable to SPM or DRAM/cache.
• First assigns scalar constants and variables to SPM.
• Arrays that are larger than SPM are mapped onto DRAM.
![Page 20: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/20.jpg)
Partitioning Algorithm• For remaining (n) arrays, generates lifetime
intervals and computes LCF and IF values.• Sorts the 2n interval points thus generated and
traverses them in increasing order.• For each array u encountered, if there is sufficient
SRAM space for u and all arrays with lifetimes intersecting the lifetime interval of u, with more critical LCF and IF nos., then maps u to SPM else to DRAM/cache.
![Page 21: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/21.jpg)
Performance Details for Beamformer Example
![Page 22: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/22.jpg)
Typical Applications• Dequantde-quantization routine in MPEG decoder
application• IDCTInverse Discrete Cosine Transform• SORSuccessive Over Relaxation Algorithm• MatrixMultMatrix multiplication• FFTFast Fourier Transform• DHRCDifferential Heat Release Computation
Algorithm
![Page 23: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/23.jpg)
Performance Comparison of Configurations A, B, C and D
![Page 24: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/24.jpg)
Conclusion• Average improvement of 31.4% over A (only
SRAM)• Average improvement of 30.0% over B (only
cache)• Average improvement of 33.1% over C
(random partitioning)
![Page 25: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/25.jpg)
Compiler Decided Dynamic
Memory allocation for Scratch Pad Based Embedded Systems.
![Page 26: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/26.jpg)
Cache is one of the option for Onchip Memory
CPU Internal ROM
External DRAM
Cache
![Page 27: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/27.jpg)
Why All Embedded Systems Don't Have Cache Memory
The reasons could be • Increased On Chip Area• Increased Energy • Increased Cost • Hit Latency and Undeterministic Cache Access
![Page 28: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/28.jpg)
A method for allocating program data to non-cached SRAM• Dynamic i.e. allocation changes at runtime• Compiler-decided transfers• Zero overhead per-memory-instruction
unlike software or hardware caching• Has no software Caching tags• Requires no run time checks• High Predictable memory access times
![Page 29: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/29.jpg)
Static Approach
int a[100];int b[100];…while(i<100) …..a……
while(i<100)……b…...
Allocator
External DRAM
Internal SRAM
Int b[100]
![Page 30: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/30.jpg)
Static Approach
int a[100];int b[100];…while(i<100) …..a……
while(i<100)……b…...
Allocator
External DRAM
Internal SRAMInt a[100]
Int b[100]
![Page 31: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/31.jpg)
Dynamic Approach
int a[100];int b[100];…while(i<100) …..a……
while(i<100)……b…...
Allocator
External DRAM
Internal SRAMInt a[100]
Int b[100]
![Page 32: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/32.jpg)
Dynamic Approach
int a[100];int b[100];
while(i<100)……a…...while(i<100)……b……
Allocator
External DRAM
Internal SRAMint b[100]
int a[100]
It is similar to caching, but under compiler control
![Page 33: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/33.jpg)
Compiler-Decided Dynamic Approach
int a[100];int b[100];…// a is in SRAM while(i<100)……a…….// Copy a out to DRAM// Copy b in to SRAM
while(i<100)……..b…..…
Decide on dynamic behavior statically
•Need to minimize costs for greater benefit •Accounts for changing program Requirements at run time•Compiler manages and decides the transfers between sram and dram
Transfer cost
![Page 34: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/34.jpg)
Approach The method is to
• Use profiling to estimate reuse• Copy variables in to SRAM when reused
• Cost model ensures that benefit exceeds cost
• Transfers data between the On chip and Off chip memory under compiler supervision
• Compiler-known data allocation at each point in the code
![Page 35: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/35.jpg)
Advantages • Benefits with no software translation overhead• Predictable SRAM accesses ensuring better real-
time guarantees than Hardware or Software caching
• No more data transfers than caching
![Page 36: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/36.jpg)
Overview of Strategy
Divide the complete program into different regions For (Starting Point of each Region)< Remove Some Variables from Sram Copy Some Variables into Sram from Dram>
![Page 37: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/37.jpg)
Some Imp Questions
What are regions ? What to bring in to SRAM ?What to evict from SRAM ?
The Problem has an exponential number of Solutions (NP Complete)
![Page 38: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/38.jpg)
Regions • It is the code between successive program points• Coincide with changes in program behavior• New regions start at:• Start of each procedure• Before start of each loop• Before conditional statements containing loops,
procedures
![Page 39: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/39.jpg)
What to Bring in to SRAM ?• Bring in variables that are re-used in region,
provided cost of transfer is recovered.• These transfers will reduce the memory access
time• Cost model accounts for:
• Profile estimated re-use• Benefit from reuse • Detailed Cost of transfer
• Bring in cost • Eviction cost
![Page 40: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/40.jpg)
What to Remove from SRAM?in the future.
Need concept of time order of different code regions
The data variables that are furthest in the futureThis time can be obtained by assigning timestamps for each of the nodes
![Page 41: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/41.jpg)
The Data-Program Relationship Graph
• The DPGR is a new data structure that helps in identification of regions and marking of time stamps
• It is essentially a program’s call graph appended with additional nodes for • Loop nodes • Variable nodes
![Page 42: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/42.jpg)
Data-Program Relationship Graph
a b
Proc_B
1
7
3
2
Proc_A
main
5
4
6
• Defines regions
Defines Regions
Depth first search order reveals execution time.
order• “Allocation-change points” at region changes
Proc_Cloop
loop
![Page 43: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/43.jpg)
Time Stamps• A method associates a time stamp with every
program point• The time stamp forms a total order among
themselves• The program points are reached during the
runtime in time stamp order.
![Page 44: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/44.jpg)
Optimizations• The is no need to write back unmodified or
dead SRAM variables into DRAM• Optimize data transfer code using DMA when
it is available• Data transfer code can be placed in special
memory block copy procedures
![Page 45: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/45.jpg)
Multiple Allocations due to Multiple Paths•
•Contents of SRAM could be different on different incoming paths to a node in DPRG
• Problem can happen in
• Loops
• Conditional execution
• Multiple calls to same procedure
![Page 46: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/46.jpg)
Conditional join nodes
• Favor the most frequent path
• Consensus allocation is chosen assuming the incoming allocation from the most probable predecessor
Join Node
![Page 47: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/47.jpg)
Procedure join nodes
• Few program points have multiple timestamps• The nodes with multiple timestamps are called join
nodes as they join multiple paths from main()• A strategy is used that adopts different allocation
strategies for different paths but with same code
![Page 48: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/48.jpg)
Offsets in SRAM• SRAM can get fragmented when variables are
swapped out
• Intelligent offset mechanism required
• In this method
• Place memory variables with similar lifetimes together larger fragments when evicted together
•
![Page 49: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/49.jpg)
Experimental Setup• Architecture: Motorola MCORE
• Memory architecture : 2 levels of memory
• SRAM size: Estimated as 25% of the total data requirement
• DRAM latency 10 cycles
• Compiler : Gcc
![Page 50: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/50.jpg)
Results
![Page 51: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/51.jpg)
Conclusion
The designer has to choose the right mix of Scratch pad and Cache for
performance advantages.
![Page 52: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/52.jpg)
References• Sumesh U ,Rajeev B. Compiler Decided Dynamic Memory Allocation for Scratch Pad Based
Embedded Systems .• Alexandru N ,Preeti P, N Dutt . Efficient Use of Scratch Pads in Embedded Applications • Josh Pfrimmer, Kin F. Li, and Daler Rakhmatov Balancing Scratch Pad and Cache in Embedded Systems for Power and
Speed Performance
![Page 53: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/53.jpg)
Questions
![Page 54: Scratch Pad](https://reader036.fdocuments.us/reader036/viewer/2022070419/55cf9a6f550346d033a1b36d/html5/thumbnails/54.jpg)
Thank you