A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...

A Fast On-Chip Profiler MemoryA Fast On-Chip Profiler Memory

Roman Lysecky, Susan Cotterell, Frank Vahid*Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine

This work was supported in part by the National Science Foundation

2

OutlineOutline

• Introduction

• Problem Definition

• Profiling Techniques

• Pipelined Binary Search Tree

• ProMem

• Conclusions

3

IntroductionIntroduction

MemProcessor

Per.Per.Per. Per.

I$

D$

Bridge

Monitor Embedded Bus

ProMem

Our Solution:Add On-Chip Profiler

Memory to Monitored Bus

• Accepts 1 pattern/cycle• Keeps Exact Counts

Goal: Determine # of Times Each Target Pattern

Appears on the Bus

Monitor Embedded Bus

4


MemProcessor

Per. Per. Per. FPGA

prog.c

void compute() { // small Loop A for(i=0;…;…)

…

// small Loop B for(x=0;…;…)}

…

Loop

N

Inst

ruct

ions

Loop

A …

Profile Information

Profile

Move Loop A to

HW

Synthesis

Configure FPGA

FPGA

Most Instruction

sExecuted

5


• Profiling Can Be Used to Solve Many Problems– Optimization of frequently executed

subroutines– Mapping frequently executed code and

data to non-interfering cache regions– Synthesis of optimized hardware for

common cases– Identifying frequent loops to map to a small

low-power loop cache– Many Others!

6

Problem DefinitionProblem Definition

• Objective– Count number of

times each target pattern appears on bus B

• Requirements– Accept input patterns

on every clock cycle– Monitoring any bus,

e.g., deeply embedded buses in SOCs

– Non-intrusive– Exact target pattern

count

TP CTPtp1 1120

3tp2 8876… …tpm ctpm

Target Patterns TP = {tpi, …,

tpm}

Target Pattern Counts CTP = {ctpi, …, ctpm}

MemProcessor

Per. Per. Per. Per.

p1 pm…p2

Bus BInput Patterns

P={pi , …, pm}

7

Profiling Techniques - SoftwareProfiling Techniques - Software

• Instrumenting Software– Adding code to

count frequencies of desired code regions

• Problems– Incurs runtime

overhead– Possibly changes

program behavior – Increase in code size

for( … ){

…

ctpm++;

}

MemProcessor

Per. Per. Per. Per.

p1

pm…p2prog.c

8


• Periodic Sampling– Interrupt processor

at periodic interval– Read program

counter and other internal registers

• Problems– Disruption of

runtime behavior during interrupt

– Inaccurate

// ISR period = 10msISR{

//update profile info

}

MemProcessor p

1pm…p

2prog.c

Per. Per. Per. Per.

9


• Simulation– Execute application on

instruction set simulator

– Simulator keeps track of profile information

• Problems– Difficult to model

external environment which leads to inaccuracy

– Extremely slow

ISS

prog.c

profile informatio

n

10

Profiling Techniques - HardwareProfiling Techniques - Hardware

• Logic Analyzer– Probes placed

directly on bus to be monitored

• Problems– Cannot monitor

embedded buses

MemProcessor p

1pm…p

2

Per. Per. Per. Per.

11


• Processor Support– Mainly event

counters– Monitored events

include cache misses, pipeline stalls, etc.

• Problems– Few registers

available– Reconfiguration

needed to obtain a complete profile

– Leads to inaccuracy

MemProcessor

p1 pm…p2

Per. Per. Per. Per.

12


• Content-addressable memories (CAMs)– Fast search for a

key in a large data set

– Returns the address at which the key resides in a memory

• Types– Fully Associative– RAM coupled with

a smart controller

MemProcessor

CAM

p1 pm…p2

Per. Per.

Per. Per.

13


• Fully Associative CAMs– Simultaneously

compares every location with the key

• Problems– Does not scale well

to larger memories– Increased access

time as CAM size grows

– Large Power Consumption

MemProcessor

CAM

p1 pm…p2

tp1

tpm

…

=

=

tp2 =

tp3 =

Per. Per.

Per. Per.

14


• RAM coupled with a smart controller– Efficient lookup

data structure in memory such as a binary tree or Patricia Trie

• Problems– Multiple cycle

lookup

Ctr

l

SRAM

MemProcessor

CAM

p1 pm…p2

Per. Per.

Per. Per.

15

ObservationsObservations

• Not necessary to have 1 cycle look up

• Only need to accept one input pattern every cycle

16

QueueingQueueing

• Hold input patterns in queue until we are able to process them

• Problems– Does not work with

patterns arriving every clock cycle

Ctr

l

SRAM

CAM

Bus B

FIFO

17

PipeliningPipelining

• Implemented in processors such that instructions can be executed every cycle

• Can we use pipelining to solve our problem?

18

Pipelined CAMPipelined CAM

• Large CAMs required long access times

• Partition large CAM into several smaller CAMs– Requires pipelining to reduce

access time– Provides solution to access

time problem– Requires Large Area– Large Power Consumption

CAM

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg CA

MCAM

CAM

CAM

19

Pipelined CAMPipelined CAM

• Entries can be stored in a CAM in any order– requires sequential lookup in pipelined CAM

approach

• Is there a benefit to sorting the entries?– not necessary to search all entries– leads to faster lookup time

• Tree structure provides a inherently sorted structure– Search time remains a problem– Can we pipeline the structure?

20

Pipelined TreePipelined Tree

• Solves access time problem– One memory access per

level

• Solves area problem– Single comparator per

level– Each level grows by

factor of two– For large memories,

comparators are negligible

=

=

=

=

21

Pipelined Binary Search TreePipelined Binary Search Tree

Root Node

Each node has at most two children

Left child > Parent

Right child < Parent

ace

j d

bfik

g

h

22


Searching for Input Pattern: f

ace

j d

bfik

g

hh

f > d, go leftd

f < h, go righth

f = f, Found!f

d

hSta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

23


ace

j d

bfik

g

h

001

011

000

11

00

10

1 0

01

010

e = e, Found!

f

01

d

h

0

e

010

e < f, append 0 to addressf

01

d

h

0

010

e > d, append 1 to address

01

d

h

0

e < h, append 0 to addressh

0

Sta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

Searching for Input Pattern: e

24


ace

j d

bfik

g

hSta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

Searching for Input Pattern: e, f

e < f, append 0 to addressf

010

df > d, append 1 to address

01

e < h, append 0 to addressh

0

e = e, Found!e

ff = f, Found!

e > d, append 1 to address

01

d

f < h, append 0 to addressh

0

25


Sta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

- - - -- - - -

001

011

000

11

00

10

1 0

01

010

001

011

000

010

ace

j d

bfik

g

h

StandardMemories

26

ProMem – Module DesignProMem – Module DesignInput Pattern Search

AddressEnable

Search Address

(Next Stage)

Enable (Next Stage)

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Input Pattern

27

ProMem – Module DesignProMem – Module Design

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Target Pattern Memory TPMs

(2s×w)

addr rd

dout

28

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s


TPMs

(2s×w)

addr rd

dout

Target Pattern Not Found – Enable

Next Stage

Target Pattern FoundSearch for

Target Pattern Compare> =

29

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s


Compare> =

TPMs

(2s×w)

addr rd

dout

Target Pattern Count Memory

CMs

(2s×c)

wr

rd addr

dout

30

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s


Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

When Target Pattern Found - Update Count

Value

+1

1

31

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s


Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

+1

1

Pipeline Register

Memories

ModuleController

Module Controller

32

ProMem - InterfaceProMem - Interface

• Simple Interface – Internal interface

• Enable signal• Connection to

monitored bus

– External interface• Read enable• Write enable• Connection to

ProMem pattern input bus

MemProcessor

ProMem

p1 pm…p2

renwen

addrcen

clk

Per. Per.

Per. Per.

33

ProMem - LayoutProMem - Layout

• Efficient Layout – Achieved by simply

abutting each module with the next

– Results in very short bus wires between each module

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

+1

1

34

ProMem Results – Area*ProMem Results – Area*

0

20000

40000

60000

80000

100000

120000St

age

0

Stag

e 1

Stag

e 2

Stag

e 3

Stag

e 4

Stag

e 5

Stag

e 6

Stag

e 7

Size

(G

ates

)

0

1000

2000

3000

4000

5000

Stag

e 0

Stag

e 1

Stag

e 2

Stag

e 3

Stag

e 4

Stag

e 5

Stag

e 6

Stag

e 7

Memory

PipelineRegister

ModuleController

Module overhead only

1%

*Area obtained using UMC .18 technology library provided by Artisan Components

35

ProMem Results – vs. CAMProMem Results – vs. CAM

0

1000000

2000000

3000000

4000000

5000000

6000000

1 3 7 15 31 63 127

255

511

1023

2047

4095

Number of Entries

Size

(ga

tes)

ProMem

CAM

0

1000000

2000000

3000000

4000000

5000000

6000000

1 3 7 15 31 63 127

255

511

1023

2047

4095

Number of Entries

Size

(ga

tes)

ProMem (Trend)

CAM (Trend)

CAM design is 46% larger

than ProMem

36

ProMem Results – Timing vs. CAMProMem Results – Timing vs. CAM

0

5

10

15

20

25

30

35

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0Number of Entries

Acc

ess

Tim

e (n

s)

ProMem

CAM

CAM access time grows

with CAM size

ProMem access time remains

constant (Due to

Pipelining)

37

ConclusionsConclusions

• Introduced a new memory structure specifically for fast on-chip profiling

• One pattern per cycle throughput

• Simple interface to monitored bus

• Efficient design is very scalable

A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...

Documents

Transcript of A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...