A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...
-
Upload
robyn-liliana-harrell -
Category
Documents
-
view
216 -
download
0
Transcript of A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...
A Fast On-Chip Profiler MemoryA Fast On-Chip Profiler Memory
Roman Lysecky, Susan Cotterell, Frank Vahid*Department of Computer Science and Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine
This work was supported in part by the National Science Foundation
2
OutlineOutline
• Introduction
• Problem Definition
• Profiling Techniques
• Pipelined Binary Search Tree
• ProMem
• Conclusions
3
IntroductionIntroduction
MemProcessor
Per.Per.Per. Per.
I$
D$
Bridge
Monitor Embedded Bus
ProMem
Our Solution:Add On-Chip Profiler
Memory to Monitored Bus
• Accepts 1 pattern/cycle• Keeps Exact Counts
Goal: Determine # of Times Each Target Pattern
Appears on the Bus
Monitor Embedded Bus
4
IntroductionIntroduction
MemProcessor
Per. Per. Per. FPGA
prog.c
void compute() { // small Loop A for(i=0;…;…)
…
// small Loop B for(x=0;…;…)}
…
Loop
N
Inst
ruct
ions
Loop
A …
Profile Information
Profile
Move Loop A to
HW
Synthesis
Configure FPGA
FPGA
Most Instruction
sExecuted
5
IntroductionIntroduction
• Profiling Can Be Used to Solve Many Problems– Optimization of frequently executed
subroutines– Mapping frequently executed code and
data to non-interfering cache regions– Synthesis of optimized hardware for
common cases– Identifying frequent loops to map to a small
low-power loop cache– Many Others!
6
Problem DefinitionProblem Definition
• Objective– Count number of
times each target pattern appears on bus B
• Requirements– Accept input patterns
on every clock cycle– Monitoring any bus,
e.g., deeply embedded buses in SOCs
– Non-intrusive– Exact target pattern
count
TP CTPtp1 1120
3tp2 8876… …tpm ctpm
Target Patterns TP = {tpi, …,
tpm}
Target Pattern Counts CTP = {ctpi, …, ctpm}
MemProcessor
Per. Per. Per. Per.
p1 pm…p2
Bus BInput Patterns
P={pi , …, pm}
7
Profiling Techniques - SoftwareProfiling Techniques - Software
• Instrumenting Software– Adding code to
count frequencies of desired code regions
• Problems– Incurs runtime
overhead– Possibly changes
program behavior – Increase in code size
for( … ){
…
ctpm++;
}
MemProcessor
Per. Per. Per. Per.
p1
pm…p2prog.c
8
Profiling Techniques - SoftwareProfiling Techniques - Software
• Periodic Sampling– Interrupt processor
at periodic interval– Read program
counter and other internal registers
• Problems– Disruption of
runtime behavior during interrupt
– Inaccurate
// ISR period = 10msISR{
//update profile info
}
MemProcessor p
1pm…p
2prog.c
Per. Per. Per. Per.
9
Profiling Techniques - SoftwareProfiling Techniques - Software
• Simulation– Execute application on
instruction set simulator
– Simulator keeps track of profile information
• Problems– Difficult to model
external environment which leads to inaccuracy
– Extremely slow
ISS
prog.c
profile informatio
n
10
Profiling Techniques - HardwareProfiling Techniques - Hardware
• Logic Analyzer– Probes placed
directly on bus to be monitored
• Problems– Cannot monitor
embedded buses
MemProcessor p
1pm…p
2
Per. Per. Per. Per.
11
Profiling Techniques - HardwareProfiling Techniques - Hardware
• Processor Support– Mainly event
counters– Monitored events
include cache misses, pipeline stalls, etc.
• Problems– Few registers
available– Reconfiguration
needed to obtain a complete profile
– Leads to inaccuracy
MemProcessor
p1 pm…p2
Per. Per. Per. Per.
12
Profiling Techniques - HardwareProfiling Techniques - Hardware
• Content-addressable memories (CAMs)– Fast search for a
key in a large data set
– Returns the address at which the key resides in a memory
• Types– Fully Associative– RAM coupled with
a smart controller
MemProcessor
CAM
p1 pm…p2
Per. Per.
Per. Per.
13
Profiling Techniques - HardwareProfiling Techniques - Hardware
• Fully Associative CAMs– Simultaneously
compares every location with the key
• Problems– Does not scale well
to larger memories– Increased access
time as CAM size grows
– Large Power Consumption
MemProcessor
CAM
p1 pm…p2
tp1
tpm
…
=
=
tp2 =
tp3 =
Per. Per.
Per. Per.
14
Profiling Techniques - HardwareProfiling Techniques - Hardware
• RAM coupled with a smart controller– Efficient lookup
data structure in memory such as a binary tree or Patricia Trie
• Problems– Multiple cycle
lookup
Ctr
l
SRAM
MemProcessor
CAM
p1 pm…p2
Per. Per.
Per. Per.
15
ObservationsObservations
• Not necessary to have 1 cycle look up
• Only need to accept one input pattern every cycle
16
QueueingQueueing
• Hold input patterns in queue until we are able to process them
• Problems– Does not work with
patterns arriving every clock cycle
Ctr
l
SRAM
CAM
Bus B
FIFO
17
PipeliningPipelining
• Implemented in processors such that instructions can be executed every cycle
• Can we use pipelining to solve our problem?
18
Pipelined CAMPipelined CAM
• Large CAMs required long access times
• Partition large CAM into several smaller CAMs– Requires pipelining to reduce
access time– Provides solution to access
time problem– Requires Large Area– Large Power Consumption
CAM
Pi p
eli n
e
Reg
Pi p
eli n
e
Reg
Pi p
eli n
e
Reg
Pi p
eli n
e
Reg CA
MCAM
CAM
CAM
19
Pipelined CAMPipelined CAM
• Entries can be stored in a CAM in any order– requires sequential lookup in pipelined CAM
approach
• Is there a benefit to sorting the entries?– not necessary to search all entries– leads to faster lookup time
• Tree structure provides a inherently sorted structure– Search time remains a problem– Can we pipeline the structure?
20
Pipelined TreePipelined Tree
• Solves access time problem– One memory access per
level
• Solves area problem– Single comparator per
level– Each level grows by
factor of two– For large memories,
comparators are negligible
=
=
=
=
21
Pipelined Binary Search TreePipelined Binary Search Tree
Root Node
Each node has at most two children
Left child > Parent
Right child < Parent
ace
j d
bfik
g
h
22
Pipelined Binary Search TreePipelined Binary Search Tree
Searching for Input Pattern: f
ace
j d
bfik
g
hh
f > d, go leftd
f < h, go righth
f = f, Found!f
d
hSta
ge
0Sta
ge
1Sta
ge
2Sta
ge
3
23
Pipelined Binary Search TreePipelined Binary Search Tree
ace
j d
bfik
g
h
001
011
000
11
00
10
1 0
01
010
e = e, Found!
f
01
d
h
0
e
010
e < f, append 0 to addressf
01
d
h
0
010
e > d, append 1 to address
01
d
h
0
e < h, append 0 to addressh
0
Sta
ge
0Sta
ge
1Sta
ge
2Sta
ge
3
Searching for Input Pattern: e
24
Pipelined Binary Search TreePipelined Binary Search Tree
ace
j d
bfik
g
hSta
ge
0Sta
ge
1Sta
ge
2Sta
ge
3
Searching for Input Pattern: e, f
e < f, append 0 to addressf
010
df > d, append 1 to address
01
e < h, append 0 to addressh
0
e = e, Found!e
ff = f, Found!
e > d, append 1 to address
01
d
f < h, append 0 to addressh
0
25
Pipelined Binary Search TreePipelined Binary Search Tree
Sta
ge
0Sta
ge
1Sta
ge
2Sta
ge
3
- - - -- - - -
001
011
000
11
00
10
1 0
01
010
001
011
000
010
ace
j d
bfik
g
h
StandardMemories
26
ProMem – Module DesignProMem – Module DesignInput Pattern Search
AddressEnable
Search Address
(Next Stage)
Enable (Next Stage)
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
Input Pattern
27
ProMem – Module DesignProMem – Module Design
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
Target Pattern Memory TPMs
(2s×w)
addr rd
dout
28
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
ProMem – Module DesignProMem – Module Design
TPMs
(2s×w)
addr rd
dout
Target Pattern Not Found – Enable
Next Stage
Target Pattern FoundSearch for
Target Pattern Compare> =
29
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
ProMem – Module DesignProMem – Module Design
Compare> =
TPMs
(2s×w)
addr rd
dout
Target Pattern Count Memory
CMs
(2s×c)
wr
rd addr
dout
30
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
ProMem – Module DesignProMem – Module Design
Compare> =
TPMs
(2s×w)
addr rd
dout
CMs
(2s×c)
wr
rd addr
dout
When Target Pattern Found - Update Count
Value
+1
1
31
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
ProMem – Module DesignProMem – Module Design
Compare> =
TPMs
(2s×w)
addr rd
dout
CMs
(2s×c)
wr
rd addr
dout
+1
1
Pipeline Register
Memories
ModuleController
Module Controller
32
ProMem - InterfaceProMem - Interface
• Simple Interface – Internal interface
• Enable signal• Connection to
monitored bus
– External interface• Read enable• Write enable• Connection to
ProMem pattern input bus
MemProcessor
ProMem
p1 pm…p2
renwen
addrcen
clk
Per. Per.
Per. Per.
33
ProMem - LayoutProMem - Layout
• Efficient Layout – Achieved by simply
abutting each module with the next
– Results in very short bus wires between each module
> ps > As > cen
cen_ops_o As+1_o
ps_i As_i cen_i
Pipeline regs
ProMem stage s
Compare> =
TPMs
(2s×w)
addr rd
dout
CMs
(2s×c)
wr
rd addr
dout
+1
1
34
ProMem Results – Area*ProMem Results – Area*
0
20000
40000
60000
80000
100000
120000St
age
0
Stag
e 1
Stag
e 2
Stag
e 3
Stag
e 4
Stag
e 5
Stag
e 6
Stag
e 7
Size
(G
ates
)
0
1000
2000
3000
4000
5000
Stag
e 0
Stag
e 1
Stag
e 2
Stag
e 3
Stag
e 4
Stag
e 5
Stag
e 6
Stag
e 7
Memory
PipelineRegister
ModuleController
Module overhead only
1%
*Area obtained using UMC .18 technology library provided by Artisan Components
35
ProMem Results – vs. CAMProMem Results – vs. CAM
0
1000000
2000000
3000000
4000000
5000000
6000000
1 3 7 15 31 63 127
255
511
1023
2047
4095
Number of Entries
Size
(ga
tes)
ProMem
CAM
0
1000000
2000000
3000000
4000000
5000000
6000000
1 3 7 15 31 63 127
255
511
1023
2047
4095
Number of Entries
Size
(ga
tes)
ProMem (Trend)
CAM (Trend)
CAM design is 46% larger
than ProMem
36
ProMem Results – Timing vs. CAMProMem Results – Timing vs. CAM
0
5
10
15
20
25
30
35
0
1000
0
2000
0
3000
0
4000
0
5000
0
6000
0
7000
0Number of Entries
Acc
ess
Tim
e (n
s)
ProMem
CAM
CAM access time grows
with CAM size
ProMem access time remains
constant (Due to
Pipelining)
37
ConclusionsConclusions
• Introduced a new memory structure specifically for fast on-chip profiling
• One pattern per cycle throughput
• Simple interface to monitored bus
• Efficient design is very scalable