L6: Lower Power Architecture Design

22
SungKyunKwan Univ . 1 VADA Lab. L6: Lower Power Architectu re Design 1999. 8.2 성성성성성성 성 성 성 성성 http://vada.skku.ac.kr

description

L6: Lower Power Architecture Design. 1999. 8.2 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. Through WAVE PIPELINING. Wave-pipelining on FPGA. Pipeline 의 문제점 Balanced partitioning Delay element overhead Tclk > Tmax - Tmin + clock skew + setup/hold time Area, Power, 전체 지연시간의 증가 - PowerPoint PPT Presentation

Transcript of L6: Lower Power Architecture Design

Page 1: L6: Lower Power Architecture Design

SungKyunKwan Univ.

1VADA Lab.

L6:Lower Power Architecture Desig

n

1999. 8.2

성균관대학교 조 준 동 교수http://vada.skku.ac.kr

Page 2: L6: Lower Power Architecture Design

SungKyunKwan Univ.

2VADA Lab.

Through WAVE PIPELINING

Page 3: L6: Lower Power Architecture Design

SungKyunKwan Univ.

3VADA Lab.

Wave-pipelining on FPGA• Pipeline 의 문제점

– Balanced partitioning

– Delay element overhead

– Tclk > Tmax - Tmin + clock skew + setup/hold time– Area, Power, 전체 지연시간의 증가– Clock distribution problem

• Wavepipelining = high throughput w/o

such overhead =Ideal pipelining

Page 4: L6: Lower Power Architecture Design

SungKyunKwan Univ.

4VADA Lab.

FPGA on WavePipeline• LUT 의 delay 는 다양한 logic function

에서도 비슷하다 .• 동일 delay 를 구성할 수 있다 .• FPGA element delay (wire, LUT,

interconnection)• Powerful layout editor• Fast design cycle

Page 5: L6: Lower Power Architecture Design

SungKyunKwan Univ.

5VADA Lab.

WP advantages

• Area efficient - register, clock distribution network & clock buffer 필요 없음 .

• Low power dissipation• Higher throughput• Low latency

Page 6: L6: Lower Power Architecture Design

SungKyunKwan Univ.

6VADA Lab.

Disadvantage

• Degraded performance in certain case • Difficult to achieve sharp rise and fall

time in synchronous design• Layout is critical for balancing the delay• Parameter variation - power supply and

temperature dependence

Page 7: L6: Lower Power Architecture Design

SungKyunKwan Univ.

7VADA Lab.

Experimental ResultsConventional Pipeline wavepipeline

Register 0 286 28

Max pathdelay

74.188ns 12.730 ns 68.969 ns

Min. pathdelay

9.0ns 52.356 ns

Max Freq. 13.5 MHz 78.6 MHz 50 MHz

CLB # 49 143 148

Latency 75ns 169 ns (13clk)

80 ns

Power 19.6mW/Mhz 76.8mW/MHz +clock driver

64.8mW/MHz

By 이재형 , SKKU

Page 8: L6: Lower Power Architecture Design

SungKyunKwan Univ.

8VADA Lab.

Observation• WP multiplier 는 delay 를 조절하기 위한 LUTs 의

추가가 많아서 전력소모 면에서 큰 이득은 보지 못했다 .

• FPGA 에서 delay 를 조절하기 위해 LUTs 나 net delay를 사용하지 않고 별도의 delay 소자를 사용하면 보다 효과적

• 또한 , 동일한 level 을 가지는 multiplier 를 설계하면 WP 구현이 용이하고 pipeline 구조보다 전력소모나 면적에서 큰 이득을 얻을 수 있을 것이다 .

Page 9: L6: Lower Power Architecture Design

SungKyunKwan Univ.

9VADA Lab.

VON NEUMANN VERSUS HARVARD

Page 10: L6: Lower Power Architecture Design

SungKyunKwan Univ.

10VADA Lab.

Power vs Area of Micro-coded Microprocessor

1.5V and 10MHz clock rate: instruction and data memory accesses account for 47% of the total power consumption.

Page 11: L6: Lower Power Architecture Design

SungKyunKwan Univ.

11VADA Lab.

Memory Architecture

Page 12: L6: Lower Power Architecture Design

SungKyunKwan Univ.

12VADA Lab.

Exploiting Locality for Low-Power Design

•Power consumption (mW) in the maximally time-shared and fully-parallel versions of the QMF sub-band coder filter• Improvement of a factor of 10.5 at the expense of a 20% increase in area• The interconnect elements (buses, multiplexers, and buffers) consumes 43% and 28% of the total power inthe time-shared and parallel versions.

•A spatially local cluster: group of algorithm operations that are tightlyconnected to each other in the flow graph representation.• Two nodes are tightly connected to each other on the flow graph representation if the shortest distance between them, in terms of number of edges traversed, is low.

Page 13: L6: Lower Power Architecture Design

SungKyunKwan Univ.

13VADA Lab.

Cascade filter layouts

(a)Non-local implementation from Hyper (b)Local implementation from Hyper-LP

Page 14: L6: Lower Power Architecture Design

SungKyunKwan Univ.

14VADA Lab.

Frequency Multipliers and Dividers

Page 15: L6: Lower Power Architecture Design

SungKyunKwan Univ.

15VADA Lab.

Low Power DSP

• Instruction Buffer ( 또는 Cache)locality 이용Program memory 의 access 를 줄인다 .

• Decoded Instruction Buffer– LOOP 의 첫번째 iteration 의 decoding 결과를

RAM 에 저장한 후 재사용– Fetch/Decoding 과정을 제거– 30~40% Power Saving

Page 16: L6: Lower Power Architecture Design

SungKyunKwan Univ.

16VADA Lab.

Stage-Skip Pipeline

•The power savings is achieved by stopping the instruction fetch and decode stages of the processor duringthe loop execution except its first iteration.•DIB = Decoded Instruction Buffer• 40 % power savings using DSP or RISC processor.

Page 17: L6: Lower Power Architecture Design

SungKyunKwan Univ.

17VADA Lab.

Stage-Skip Pipeline

•Selector: selects the output from either the instruction decoder or DIB• The decoded instruction signals for a loop are temporarily stored in the DIB and are reused in each iterationof the loop. •The power wasted in the conventional pipeline is saved in our pipeline by stopping the instruction fetching and decoding for each loop execution.

Page 18: L6: Lower Power Architecture Design

SungKyunKwan Univ.

18VADA Lab.

Stage-Skip Pipeline

Majority of execution cycles in signal processing programs are used for loop execution : 40% reduction in power with area increase 2%.

Page 19: L6: Lower Power Architecture Design

SungKyunKwan Univ.

19VADA Lab.

Two’s complement implementation of an accumulator

Page 20: L6: Lower Power Architecture Design

SungKyunKwan Univ.

20VADA Lab.

Sign magnitude implementation of

an accumulator.

Page 21: L6: Lower Power Architecture Design

SungKyunKwan Univ.

21VADA Lab.

Number representation trade-off for arithmetic

Page 22: L6: Lower Power Architecture Design

SungKyunKwan Univ.

22VADA Lab.

Signal statistics for Sign Magnitude implementation of the accumulator datapath assuming random inputs.