Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2...

41
Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review

Transcript of Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2...

Page 1: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

ECE 636

Reconfigurable Computing

Lecture 20

Exam 2 Review

Page 2: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

PRISC

° Architecture:• couple into register file as “superscalar” functional unit

• flow-through array (no state)

Page 3: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

PRISC Results

° All compiled

° working from MIPS binary

° <200 4LUTs ?• 64x3

° 200MHz MIPS base

Razdan/Micro27

Page 4: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Chimaera

• Start from Prisc idea.

- Integrate as a functional unit

- No state

- RFU Ops (like expfu)

- Stall processor on instruction miss

• Add

- Multiple instructions at a time

- More than 2 inputs possible

• Hauck: University of Washington

Page 5: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Chimaera Architecture

• Live copy of register file values feed into array

• Each row of array may compute from register of intermediates

• Tag on array to indicate RFUOP

Page 6: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Chimaera Architecture

• Array can operate on values as soon as placed in register file.

• Logic is combinational

• When RFUOP matches

- Stall until result ready

- Drive result from matching row

Page 7: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Chimaera Results

• Three Spec92 benchmarks

- Compress 1.11 speedup

- Eqntott 1.8

- Life 2.06

• Small arrays with limited state

• Small speedup

• Perhaps focus on global router rather than local optimization.

Page 8: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Garp

• Integrate as coprocessor

- Similar bandwidth to processor as functional unit

- Own access to memory

• Support multi-cycle operation

- Allow state

- Cycle counter to track operation

• Configuration cache, path to memory

Page 9: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Garp – UC Berkeley

• ISA – coprocessor operations

- Issue gaconfig to make particular configuration present.

- Explicitly move data to/from array

- Processor suspension during coproc operation

- Use cycle counter to track progress

• Array may directly access memory

- Processor and array share memory

- Exploits streaming data operations

- Cache/MMU maintains data consistency

Page 10: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Garp Instructions

• Interlock indicates if processor waits for array to count to zero.

• Last three instructions useful for context swap

• Processor decode hardware augmented to recognize new instructions.

Page 11: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Garp Array

• Row-oriented logic

• Dedicated path for processor/memory

• Processor does not have to be involved in array-memory path

Page 12: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Garp Results

• General results- 10-20X

improvement on stream, feed-forward operation

- 2-3x when data dependencies limit pipelining

- [Hauser-FCCM97]

Page 13: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

PRISC/Chimaera vs. Garp

• Prisc/Chimaera

- Basic op is single cycle: expfu

- No state

- Could have multiple PFUs

- Fine grained parallelism

- Not effective for deep pipelines

• Garp

- Basic op is multi-cycle – gaconfig

- Effective for deep pipelining

- Single array

- Requires state swapping consideration

Page 14: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Common Theme

• To overcome instruction expression limits:

- Define new array instructions. Make decode hardware slower / more complicated.

- Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration.

• Give array configuration short “name” which processor can call out.

• Store multiple configurations in array. Access as needed (DPGA)

Page 15: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Observation

• All coprocessors have been single-threaded

- Performance improvement limited by application parallelism

• Potential for task/thread parallelism

- DPGA

- Fast context switch

• Concurrent threads seen in discussion of IO/stream processor

• Added complexity needs to be addressed in software.

Page 16: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

FPGA Power Reduction Goals

• Dynamic power goals

- Reduce Vdd along non-critical paths

- Low swing signalling

- Use CAD approaches to limit long high-toggle paths

- Pdynamic = 0.5 * C * Vdd2 * f

• Static power goals

- Cut-off Vdd for unused transistors

- Use high Vt transistors for SRAM cells

- Various other voltage biasing techniques

Page 17: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Traditional Routing Switch

S S S...

SRAM cell CONFIG

…..

i1i2i3i4

in

MU

XMUX

S

S

S

S

i1

i2

i3

i4

MP1

OUT

VINT

MP2

level-restoringbuffer

Courtesy: Anderson

Page 18: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Proposed Switch Designs: Anderson

° Based on 3 observations:• Routing switch inputs tolerant to

weak-1 signals (level-restoring buffers).

• Considerable slack in FPGA designs many switches can be slowed down.

• Most routing switches feed other routing switches.

- Can produce weak-1 logic signals.

Page 19: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

“Basic” Switch Design

high-speed: MNX & MPX ONlow-power: MNX ON, MPX OFFsleep: MNX OFF, MPX OFF

MODEOPERATION:

OUT

VVD

~SLEEP LOW_POWER v SLEEP

VDD

GND GND

VDD

S S ...

SRAM cell CONFIG

…..

i1

i2

i3

i4

in

SMNX MPX

sLOW_POWER ~LOW_POWER

MUX

VVD

Page 20: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

OUT

VVD

~SLEEP LOW_POWER v SLEEP

VDD

GND GND

VDD

S S ...

SRAM cell CONFIG

…..

i1

i2

i3

i4

in

SMNX MPX

sLOW_POWER ~LOW_POWER

MUX

High-Speed Mode

high-speed: MNX & MPX ONlow-power: MNX ON, MPX OFFsleep: MNX OFF, MPX OFF

MODEOPERATION:

output swing:rail-to-rail.

VVD = VDD

Page 21: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Low-Power Mode

high-speed: MNX & MPX ONlow-power: MNX ON, MPX OFFsleep: MNX OFF, MPX OFF

MODEOPERATION:

output swing:GND-to-(VDD-VTH).

VVD = VDD - VTH

OUT

VVD

~SLEEP LOW_POWER v SLEEP

VDD

GND GND

VDD

S S ...

SRAM cell CONFIG

…..

i1

i2

i3

i4

in

SMNX MPX

sLOW_POWER ~LOW_POWER

MUX

VVD

output swing:GND-to-(VDD-VTH).

Page 22: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Sleep Mode

high-speed: MNX & MPX ONlow-power: MNX ON, MPX OFFsleep: MNX OFF, MPX OFF

MODEOPERATION:

OUT

VVD

~SLEEP LOW_POWER v SLEEP

VDD

GND GND

VDD

S S ...

SRAM cell CONFIG

…..

i1

i2

i3

i4

in

SMNX MPX

sLOW_POWER ~LOW_POWER

MUX

VVD

Page 23: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Leakage Power Results: Anderson

36

60.8

39.7 38.7

0.30

10

20

30

40

50

60

70

LP mode Sleep mode LP mode(+unused

fanout)

LP mode(+usedfanout)

Traditionalswitch

% le

akag

e p

ow

er

red

uct

ion

vs

. h

igh

-sp

eed

mo

de

Basic

Page 24: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

FPGA Embedded Memory Blocks

° Embedded memory blocks (EMBs) are important parts of FPGAs

° Consume roughly 14% of Altera Stratix II dynamic power *

• Increasing in recent designs

* Stratix II Low Power Applications Note, 2005

Page 25: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Embedded Memory Block Port Internal View

Write Data

MClk

MClk

Write Enable

Column MuxWrite BuffersSense Amps

Row Decode

Read Data

ReadEnable Latch

AddressMClk

MClkClk Enable

Clk

RAM cell

BIT BIT

Bit LinePre-charge

MClk

Reducing clocking saves dynamic power

Page 26: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Power Optimization #1

° Convert EMB read enable/write enable signals to associated read/write clock enable signals

° Limitations

• Each port has read or write enable control signal

• Embedded memory block has read enable input

Clock

Wren

DataData

WriteAddress

ReadAddress

Q

Write enable

Read enable

Q

Rden

Vcc VccWr clkenable

Rd clkenable

WriteAddress

ReadAddress

Clock

Wren

DataData

WriteAddress

ReadAddress

Q

Write enable

Read enable

Q

Rden

Vcc Vcc

Wr clkenable

Rd clkenable

WriteAddress

ReadAddress

Before After

Page 27: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Implementation

° Conversion mode • Ties off R/W enable to RAM clock enables

• Doesn’t make transform if CE already present on port

° Combining mode

• AND user RAM clock enables with derived R/W clock

• Could impact performance

Combined Write Clk Enable

Write Enable

User-defined Write Clk Enable

Page 28: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

FPGA RAM Processing

° FIFOs and Shift registers converted into logical RAMs

° Logical RAMs mapped to RAM blocks

FIFO, Shift Register, RAM specification

Create Logical Memory

Logical RAMs/logic

Logical-to-physical

RAM processing

RAM blocks/ logic

Memory/logic

placement

Placed Memory

Page 29: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Mapping RAM to EMBs

° Implementation choice can impact design area, performance, and power.

° Some mappings may require multiple EMBs

4k deep x 4 wide

16K bits4K bits 4K bits 4K bits 4K bits

M4K M4K M4K M4K

User-defined (logical) memory

Physical (EMB) memory

512K MRAM

Page 30: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Memory Organization

° Each EMB can be configured to have different depth and width (e.g. Stratix II M4K)

° All hold 4K bits

° Slightly lower power consumption for wider EMB configurations (not including routing)

4K words deep

1 bit wide

32 bits wide

128 words deep

8 bits wide

512 words deep

Page 31: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Area and Delay Optimal Mapping

° Configure each EMB to be as deep as possible

° Number of address bits on each EMB same as on logical memory

° Area and performance efficient: no external logic needed

° Power inefficient: All EMBs must be active during each logical RAM access

4k words deep and 1 bit wide(4 times)

Addr[0:11]

Data[0:3]

4k words deep and 4 bits wide

Logical memory

4 EMBs active during access

EMB

Vertical Slicing

Page 32: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Alternative Mapping

° Configure EMB to have width of logical RAM (e.g. 1Kx4)

• Allows shutdown of some RAMs each cycle

• But adds some logic

° Saves RAM power, adds combinational logic and register power

More Power Efficient:

1K deep x 4 wide

(4 times)

1 EMB active during access

AddrDecoder

4

Addr[0:9]

Addr[10:11]

Data[0:3]

4k words deep and 4 bits wide

Logical memory

Addr[10:11]

Horizontal Slicing

Page 33: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

RAM Slicing - Example

° Power reduction available with different slicing

4kx32 Dynamic Power

0

20

40

60

80

100

120

140

Maximum Depth

Dyn

amic

Po

wer

(m

W)

Best range

Multiplexer Power Increasing

128 256 512 1k 2k 4k

EMB Power Increasing

Page 34: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Power Optimization #2: Power-aware RAM Partitioning

° Algorithm considers possible logical to physical RAM mappings

Completed placement

Insert Decode and Mux Logic

FIFO, Shift Register

Create Logical Memory

Power-aware Physical RAM

processing

Memory/Logic

Placement

Power Library

Page 35: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Experimental Approach

° 40 designs evaluated

° Quartus 5.1

° Mapped to smallest possible device and target max frequency

° Simulation with test vectors

° Power analysis with PowerPlay

Page 36: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Experimental Approach

° 40 designs evaluated

° Quartus 5.1

° Mapped to smallest possible device and target max frequency

° Simulation with test vectors

° Power analysis with PowerPlay

Page 37: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Memory Power

° 21.0% average reduction for all techniques (9.7% with convert/combine)

-10

0

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Designs

% D

yn

Po

we

r R

ed

uc

tio

n

Enable convert/combine

Enable convert/combine + Mempartition

Page 38: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Overall Core Dynamic Power

° 6.8% average power reduction for all techniques (2.6% with convert/combine)

-5

0

5

10

15

20

25

30

35

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Designs

% D

yn.

Po

wer

Red

uct

ion

Enable convert/combine

Enable convert/combine + mempartition

Page 39: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Design Performance

° 1.0% average performance loss for all techniques (0.1% for enable convert/combine)

Average Design Clock Frequency

-30

-25

-20

-15

-10

-5

0

5

10

Designs

% F

req

uen

cy Im

pro

vem

ent

EnableConvert/Combine

EnableConvert/Combine +Mem Partition

Page 40: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Results Summary

° Almost 7% core dynamic power reduction across all designs

• Some designs benefit more than others

° Minimal clock frequency hit for most designs

Enable convert

Enable convert/ combine

Enable convert/

combine + Mem

partition

Core dynamic power -1.8% -2.6% -6.8%

Memory dynamic power -6.3% -9.7% -21.0%

Max clk freq -0.1% -0.2% -1.0%

LUT count 0.0% 0.1% 0.7%

Page 41: Lecture 20: Exam 2 Review November 21, 2013 ECE 636 Reconfigurable Computing Lecture 20 Exam 2 Review.

Lecture 20: Exam 2 Review November 21, 2013

Other material° Lecture 17: Reconfigurable Memory Security

° Lecture 18: Hardware Monitors to Protect Network Processors

° Lecture 19 is not covered on the exam