A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...

37
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems, UC Irvine This work was supported in part by the National Science Foundation

Transcript of A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer...

Page 1: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

A Fast On-Chip Profiler MemoryA Fast On-Chip Profiler Memory

Roman Lysecky, Susan Cotterell, Frank Vahid*Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine

This work was supported in part by the National Science Foundation

Page 2: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

2

OutlineOutline

• Introduction

• Problem Definition

• Profiling Techniques

• Pipelined Binary Search Tree

• ProMem

• Conclusions

Page 3: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

3

IntroductionIntroduction

MemProcessor

Per.Per.Per. Per.

I$

D$

Bridge

Monitor Embedded Bus

ProMem

Our Solution:Add On-Chip Profiler

Memory to Monitored Bus

• Accepts 1 pattern/cycle• Keeps Exact Counts

Goal: Determine # of Times Each Target Pattern

Appears on the Bus

Monitor Embedded Bus

Page 4: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

4

IntroductionIntroduction

MemProcessor

Per. Per. Per. FPGA

prog.c

void compute() { // small Loop A for(i=0;…;…)

// small Loop B for(x=0;…;…)}

Loop

N

Inst

ruct

ions

Loop

A …

Profile Information

Profile

Move Loop A to

HW

Synthesis

Configure FPGA

FPGA

Most Instruction

sExecuted

Page 5: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

5

IntroductionIntroduction

• Profiling Can Be Used to Solve Many Problems– Optimization of frequently executed

subroutines– Mapping frequently executed code and

data to non-interfering cache regions– Synthesis of optimized hardware for

common cases– Identifying frequent loops to map to a small

low-power loop cache– Many Others!

Page 6: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

6

Problem DefinitionProblem Definition

• Objective– Count number of

times each target pattern appears on bus B

• Requirements– Accept input patterns

on every clock cycle– Monitoring any bus,

e.g., deeply embedded buses in SOCs

– Non-intrusive– Exact target pattern

count

TP CTPtp1 1120

3tp2 8876… …tpm ctpm

Target Patterns TP = {tpi, …,

tpm}

Target Pattern Counts CTP = {ctpi, …, ctpm}

MemProcessor

Per. Per. Per. Per.

p1 pm…p2

Bus BInput Patterns

P={pi , …, pm}

Page 7: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

7

Profiling Techniques - SoftwareProfiling Techniques - Software

• Instrumenting Software– Adding code to

count frequencies of desired code regions

• Problems– Incurs runtime

overhead– Possibly changes

program behavior – Increase in code size

for( … ){

ctpm++;

}

MemProcessor

Per. Per. Per. Per.

p1

pm…p2prog.c

Page 8: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

8

Profiling Techniques - SoftwareProfiling Techniques - Software

• Periodic Sampling– Interrupt processor

at periodic interval– Read program

counter and other internal registers

• Problems– Disruption of

runtime behavior during interrupt

– Inaccurate

// ISR period = 10msISR{

//update profile info

}

MemProcessor p

1pm…p

2prog.c

Per. Per. Per. Per.

Page 9: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

9

Profiling Techniques - SoftwareProfiling Techniques - Software

• Simulation– Execute application on

instruction set simulator

– Simulator keeps track of profile information

• Problems– Difficult to model

external environment which leads to inaccuracy

– Extremely slow

ISS

prog.c

profile informatio

n

Page 10: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

10

Profiling Techniques - HardwareProfiling Techniques - Hardware

• Logic Analyzer– Probes placed

directly on bus to be monitored

• Problems– Cannot monitor

embedded buses

MemProcessor p

1pm…p

2

Per. Per. Per. Per.

Page 11: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

11

Profiling Techniques - HardwareProfiling Techniques - Hardware

• Processor Support– Mainly event

counters– Monitored events

include cache misses, pipeline stalls, etc.

• Problems– Few registers

available– Reconfiguration

needed to obtain a complete profile

– Leads to inaccuracy

MemProcessor

p1 pm…p2

Per. Per. Per. Per.

Page 12: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

12

Profiling Techniques - HardwareProfiling Techniques - Hardware

• Content-addressable memories (CAMs)– Fast search for a

key in a large data set

– Returns the address at which the key resides in a memory

• Types– Fully Associative– RAM coupled with

a smart controller

MemProcessor

CAM

p1 pm…p2

Per. Per.

Per. Per.

Page 13: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

13

Profiling Techniques - HardwareProfiling Techniques - Hardware

• Fully Associative CAMs– Simultaneously

compares every location with the key

• Problems– Does not scale well

to larger memories– Increased access

time as CAM size grows

– Large Power Consumption

MemProcessor

CAM

p1 pm…p2

tp1

tpm

=

=

tp2 =

tp3 =

Per. Per.

Per. Per.

Page 14: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

14

Profiling Techniques - HardwareProfiling Techniques - Hardware

• RAM coupled with a smart controller– Efficient lookup

data structure in memory such as a binary tree or Patricia Trie

• Problems– Multiple cycle

lookup

Ctr

l

SRAM

MemProcessor

CAM

p1 pm…p2

Per. Per.

Per. Per.

Page 15: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

15

ObservationsObservations

• Not necessary to have 1 cycle look up

• Only need to accept one input pattern every cycle

Page 16: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

16

QueueingQueueing

• Hold input patterns in queue until we are able to process them

• Problems– Does not work with

patterns arriving every clock cycle

Ctr

l

SRAM

CAM

Bus B

FIFO

Page 17: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

17

PipeliningPipelining

• Implemented in processors such that instructions can be executed every cycle

• Can we use pipelining to solve our problem?

Page 18: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

18

Pipelined CAMPipelined CAM

• Large CAMs required long access times

• Partition large CAM into several smaller CAMs– Requires pipelining to reduce

access time– Provides solution to access

time problem– Requires Large Area– Large Power Consumption

CAM

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg

Pi p

eli n

e

Reg CA

MCAM

CAM

CAM

Page 19: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

19

Pipelined CAMPipelined CAM

• Entries can be stored in a CAM in any order– requires sequential lookup in pipelined CAM

approach

• Is there a benefit to sorting the entries?– not necessary to search all entries– leads to faster lookup time

• Tree structure provides a inherently sorted structure– Search time remains a problem– Can we pipeline the structure?

Page 20: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

20

Pipelined TreePipelined Tree

• Solves access time problem– One memory access per

level

• Solves area problem– Single comparator per

level– Each level grows by

factor of two– For large memories,

comparators are negligible

=

=

=

=

Page 21: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

21

Pipelined Binary Search TreePipelined Binary Search Tree

Root Node

Each node has at most two children

Left child > Parent

Right child < Parent

ace

j d

bfik

g

h

Page 22: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

22

Pipelined Binary Search TreePipelined Binary Search Tree

Searching for Input Pattern: f

ace

j d

bfik

g

hh

f > d, go leftd

f < h, go righth

f = f, Found!f

d

hSta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

Page 23: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

23

Pipelined Binary Search TreePipelined Binary Search Tree

ace

j d

bfik

g

h

001

011

000

11

00

10

1 0

01

010

e = e, Found!

f

01

d

h

0

e

010

e < f, append 0 to addressf

01

d

h

0

010

e > d, append 1 to address

01

d

h

0

e < h, append 0 to addressh

0

Sta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

Searching for Input Pattern: e

Page 24: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

24

Pipelined Binary Search TreePipelined Binary Search Tree

ace

j d

bfik

g

hSta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

Searching for Input Pattern: e, f

e < f, append 0 to addressf

010

df > d, append 1 to address

01

e < h, append 0 to addressh

0

e = e, Found!e

ff = f, Found!

e > d, append 1 to address

01

d

f < h, append 0 to addressh

0

Page 25: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

25

Pipelined Binary Search TreePipelined Binary Search Tree

Sta

ge

0Sta

ge

1Sta

ge

2Sta

ge

3

- - - -- - - -

001

011

000

11

00

10

1 0

01

010

001

011

000

010

ace

j d

bfik

g

h

StandardMemories

Page 26: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

26

ProMem – Module DesignProMem – Module DesignInput Pattern Search

AddressEnable

Search Address

(Next Stage)

Enable (Next Stage)

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Input Pattern

Page 27: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

27

ProMem – Module DesignProMem – Module Design

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Target Pattern Memory TPMs

(2s×w)

addr rd

dout

Page 28: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

28

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

ProMem – Module DesignProMem – Module Design

TPMs

(2s×w)

addr rd

dout

Target Pattern Not Found – Enable

Next Stage

Target Pattern FoundSearch for

Target Pattern Compare> =

Page 29: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

29

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

ProMem – Module DesignProMem – Module Design

Compare> =

TPMs

(2s×w)

addr rd

dout

Target Pattern Count Memory

CMs

(2s×c)

wr

rd addr

dout

Page 30: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

30

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

ProMem – Module DesignProMem – Module Design

Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

When Target Pattern Found - Update Count

Value

+1

1

Page 31: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

31

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

ProMem – Module DesignProMem – Module Design

Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

+1

1

Pipeline Register

Memories

ModuleController

Module Controller

Page 32: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

32

ProMem - InterfaceProMem - Interface

• Simple Interface – Internal interface

• Enable signal• Connection to

monitored bus

– External interface• Read enable• Write enable• Connection to

ProMem pattern input bus

MemProcessor

ProMem

p1 pm…p2

renwen

addrcen

clk

Per. Per.

Per. Per.

Page 33: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

33

ProMem - LayoutProMem - Layout

• Efficient Layout – Achieved by simply

abutting each module with the next

– Results in very short bus wires between each module

> ps > As > cen

cen_ops_o As+1_o

ps_i As_i cen_i

Pipeline regs

ProMem stage s

Compare> =

TPMs

(2s×w)

addr rd

dout

CMs

(2s×c)

wr

rd addr

dout

+1

1

Page 34: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

34

ProMem Results – Area*ProMem Results – Area*

0

20000

40000

60000

80000

100000

120000St

age

0

Stag

e 1

Stag

e 2

Stag

e 3

Stag

e 4

Stag

e 5

Stag

e 6

Stag

e 7

Size

(G

ates

)

0

1000

2000

3000

4000

5000

Stag

e 0

Stag

e 1

Stag

e 2

Stag

e 3

Stag

e 4

Stag

e 5

Stag

e 6

Stag

e 7

Memory

PipelineRegister

ModuleController

Module overhead only

1%

*Area obtained using UMC .18 technology library provided by Artisan Components

Page 35: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

35

ProMem Results – vs. CAMProMem Results – vs. CAM

0

1000000

2000000

3000000

4000000

5000000

6000000

1 3 7 15 31 63 127

255

511

1023

2047

4095

Number of Entries

Size

(ga

tes)

ProMem

CAM

0

1000000

2000000

3000000

4000000

5000000

6000000

1 3 7 15 31 63 127

255

511

1023

2047

4095

Number of Entries

Size

(ga

tes)

ProMem (Trend)

CAM (Trend)

CAM design is 46% larger

than ProMem

Page 36: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

36

ProMem Results – Timing vs. CAMProMem Results – Timing vs. CAM

0

5

10

15

20

25

30

35

0

1000

0

2000

0

3000

0

4000

0

5000

0

6000

0

7000

0Number of Entries

Acc

ess

Tim

e (n

s)

ProMem

CAM

CAM access time grows

with CAM size

ProMem access time remains

constant (Due to

Pipelining)

Page 37: A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.

37

ConclusionsConclusions

• Introduced a new memory structure specifically for fast on-chip profiling

• One pattern per cycle throughput

• Simple interface to monitored bus

• Efficient design is very scalable