Device and Architecture Co-Optimization for FPGA Power Reduction

Device and Architecture Co-Optimization for FPGA Power Reduction

Device and Architecture Co-Optimization for FPGA Power Reduction

Lerong Cheng, Phoebe Wong,

Fei Li, Yan Lin,

and Prof. Lei He

EE Department, UCLA

Partially supported by NSF CAREER award CCR-0093273/0401682 and NSF grant CCR-0306682.

Address comments to [email protected]

OutlineOutline

Background and motivation

Trace-based power and delay estimation

Device and architecture co-optimization

Conclusion

Evaluation of Conventional FPGA ArchitectureEvaluation of Conventional FPGA Architecture

LUT size and cluster size have been evaluated for conventional FPGA performance and area [Ahmed et al, ISFPGA’00] power and performance [Li et al, ISFPGA ‘03] Architecture tuning leads to 2.8X energy difference and

1.5X delay difference

Logic blockI/O pad

Switch box Connection box

3

4

5

6

7

8

9

9 10 11 12 13 14 15 16 17

Critical Path Delay (ns)

To

tal

FP

GA

En

erg

y (

nJ/

cycl

e)

(8, 7)

(6, 7)

(6, 6)

(10, 5)(8, 5)

(12, 4)

(6, 5)

(8, 4)

(6, 4)

(10, 4)

(8, 6)(12, 5)

(10, 6)

(12, 6)

(10, 7)

(12, 7)

(10, 3)(12, 3)

(8, 3)

(6, 3)

Island style FPGA architecture Evaluation result

Evaluation of Low-Power FPGA ArchitectureEvaluation of Low-Power FPGA Architecture

Field programmable dual-vdd for power reduction [Lin et al, ISFPGA’05] Applying field programmable dual Vdd reduces energy-delay

product by 49%

High Vdd Logic blockLow Vdd logic block

Vdd programmable logic block

Conventional FPGA

Vdd programmable FPGA

Evaluation MethodologyEvaluation Methodology

Parasitic Extraction

Cycle-accuratePower

Simulator(Psim)

Power

Arch Spec

Logic Optimization(SIS)

Tech-Mapping (RASP)

Timing-Driven Packing (TV-Pack)

Placement & Routing (VPR)

DelayArea

Benchmark circuits

Impact of Device TuningImpact of Device Tuning

All the previous work only considers architecture tuning

Device tuning leads to 84X power difference and 12X delay difference

It is necessary to perform device tuning and architecture tuning simultaneously

Challenge of Device and Architecture Co-Optimization Challenge of Device and Architecture Co-Optimization

We consider the following architecture and device parameters during our co-optimization: Architecture parameters:

Cluster size (N) LUT size (K)

Device parameters: Supply voltage (Vdd) Threshold voltage (Vt)

Hyper-architecture (hyper-arch) is the combination of the device and architecture parameters.

Large number of hyper-arch combinations

VPR and Psim are too slow to deal with such large numberof experiments

Need fast yet accurate power and delay estimation

OutlineOutline

Back ground and motivation

Trace-based power and delay estimation Trace collection Trace based power and delay model Accuracy and efficiency verification of Trace-based estimator


Conclusion

Trace CollectionTrace Collection

VPR andPsim

PtraceShort circuit power ratio

Circuit element statistics

Switching activity

Critical path structure

Assume trace information will remain the same when device setting changes

Area

Trace

Trace Base Estimation (Ptrace) FrameworkTrace Base Estimation (Ptrace) Framework

Trace

Ptrace

Chip level delay,

power, and areaCircuit level

delay and power

Device independent

Device dependent

OutlineOutline




Conclusion

Delay Model in VPRDelay Model in VPR

Delay is calculated for each path as

Nip is number of type i elements in the path and Di is delay of

type i element Delay of the logic elements is measured by SPICE simulation Elmore delay is used for interconnect wire segments

Critical path is the path with longest delay

i

ipi DND

Delay in PtraceDelay in Ptrace

Obtain the path structure of a set of longest circuit paths

Assume that when device setting changes, the new critical path is still among the set of longest paths.

Delay computation:

i

ipi DND

Trace information

Device dependent parameters

Dynamic Power ModelDynamic Power Model

Psim Switch power

Switching activity is measured by timing simulation for each node Si is the average switching activity

Short circuit power

αsc is calculated for each node

Ptrace Switch power

Short circuit power

αsc is the average short circuit power ratio for the whole circuit

n

iiiddsw SCVfP

1

2

2

1

n

iiiddsw SCVfP

1

2

2

1

Trace information


)( rscswsc tPP

scswsc PP

Static Power ModelStatic Power Model

Psim Without power gating

With power gating

Ptrace Without power gating

With power gating

i

itistatic PNP

i

iui

tigating

ii

uistatic PNNPNP )(

i

itistatic PNP

i

iui

tigating

ii

uistatic PNNPNP )(

Trace information


OutlineOutline




Conclusion

Experiment SettingExperiment Setting

Collect trace using ITRS 70nm technology, but apply to both 100nm and 70nm technologies

20 MCNC benchmarks

Assume each benchmark works in its highest possible frequency

Power and delay are computed as geometric mean of20 benchmarks.

Evaluation range

Vdd Vt LUT size (K) Cluster size (N)

0.8~1.1 0.2~0.4 3~7 6~12

AccuracyAccuracy

Average power error is 3.4%.

Average delay error is 6.4%. Delay error is due to Ptrace ignores the impact of path branches

that considered in VPR

RuntimeRuntime

VPR and Psim for one device setting five days on eight 1.2GHz Intel Xeon servers

Ptrace for 20 device settings 80 seconds on one 1.2GHz Intel Xeon server

OutlineOutline



Device and architecture co-optimization Energy and delay tradeoff ED and area tradeoff Comparison between classes Comparison between device tuning and architecture tuning

Conclusion

Architectures Classes to be EvaluatedArchitectures Classes to be Evaluated

Hyper-architecture classes

Baseline case

Vdd suggested by ITRS Architecture same as Xilinx Virtex-II™. Vt optimized by our method with respect to the above

architecture and Vdd

Hyper-arch classes Vt

Homo-Vt Homogeneous Vt

Hetero-Vt Heterogeneous Vt

Homo-Vt+G Homogeneous Vt + Power Gating

Hetero-Vt+G Heterogeneous Vt + Power Gating

Vdd Vt LUT size (K) Cluster size (N)

0.9 0.3 4 8

OutlineOutline




Conclusion

Energy and Delay TradeoffEnergy and Delay Tradeoff

Dominant hyper-arch Hyper-arch B is inferior to A if A has less energy and smaller delay than B. Dominant hyper-archs (dom-arch) are the hyper-archs that are NOT inferior to

any other hyper-archs.

0.5

1

1.5

2

2.5

3

8 13 18 23 28Delay (ns)

En

erg

y p

er

cy

cle

(n

J)

Homo-VtHetero-VtHomo-Vt+GHetero-Vt+G

0.5

1

1.5

2

2.5

3

8 13 18 23 28Delay (ns)

En

erg

y p

er

cy

cle

(n

J)

Homo-VtHetero-VtHomo-Vt+GHetero-Vt+G

Energy and Delay TradeoffEnergy and Delay Tradeoff

Hetero-Vt can reduce power

Power gating reduces more leakage power than hetero-Vt

Hetero-Vt has less impact when power gating is applied

Min-ED Hyper-ArchMin-ED Hyper-Arch

Hyper-arch classes

Vdd (V)

CVt (V)

IVt (V) (N, K) ED

(nJ·ns)ED reduction

%

Baseline 0.9 0.3 0.3 (8,4) 26.9 -

Homo-Vt 0.9 0.3 0.3 (6,7) 23.3 13.4

Hetero-Vt 0.9 0.2 0.25 (8,4) 21.4 20.5

Homo-Vt+G 0.9 0.25 0.25 (12,4) 11.1 58.9

Hetero-Vt+G 0.9 0.2 0.25 (8,4) 11 59.0

To achieve the best energy and delay tradeoff, we find out the hyper-arch with the minimum energy and delay product (ED) Compared to the baseline, the min-ED hyper-arch of the

conventional FPGA (Homo-Vt) reduces ED by 13.4% For the Hetero-Vt class, ED is reduced by 20.5% If power gating is applied, ED can be reduced by up to 59.0%

OutlineOutline




Conclusion

ED and area Tradeoff ED and area Tradeoff

Architecture tuning has great impact on area.

To achieve the best area and ED tradeoff, we find the hyper-arch with the minimum product of area, energy and delay (AED)

ED Area Tradeoff for Classes without Power GatingED Area Tradeoff for Classes without Power Gating

Compared to the min-ED hyper arch, the min-AED hyper-arch significantly reduce area with a small ED increase

70

90

110

130

150

170

78 80 82 84 86 88 90 92 94Normalized ED

No

rmali

zed

Are

aHomo-VtHetero-Vt

Min AED hyper-arch for Class1


A1-1:{0.9, 0.3, 0.3, 6, 7 }A1-2:{1.0, 0.3, 0.3, 6, 4 }A1-3:{0.9, 0.3, 0.3, 12, 4 }A2-1:{0.9, 0.3, 0.25, 8, 5 }A2-2:{0.9, 0.3, 0.25, 12, 4 }

A2-1

A2-2

A1-1

A1-2A1-3

Sleep Transistor Size Tuning Sleep Transistor Size Tuning

When Power gating is applied, sleep transistors may increase area

The larger the sleep transistor size, the smaller the delay

Sleep transistor size tuning: Area overhead introduced by sleep transistors of

logic blocks is negligible. We consider 2X, 4X, 7X and 10X PMOS as sleep transistor for

switch buffer

ED Area Tradeoff for Classes with Power GatingED Area Tradeoff for Classes with Power Gating

The area reduction achieved by device and architecture co-optimization compensates the area overhead introduced by sleep transistors

90

110

130

150

40 41 42 43 44 45 46 47Normalized ED

Nor

mal

ized

Are

a

Homo-Vt+GHetero-Vt+G

A4-1

A4-2

A4-3

A4-4A4-5

A4-6

A3-1

A3-2

A3-3

A3-4

A3-1:{0.9, 0.25, 0.25, 12, 4, G2 }A3-2:{0.9, 0.25, 0.25, 12, 4, G4 }A3-3:{0.9, 0.25, 0.25, 12, 4, G7 }A3-4:{0.9, 0.25, 0.25, 12, 4, G10}A4-1:{0.9, 0.2, 0.25, 12, 4, G2 }A4-2:{0.9, 0.2, 0.25, 12, 4, G4 }A4-3:{0.9, 0.2, 0.25, 10, 4, G4 }A4-4:{0.9, 0.2, 0.25, 6, 4, G7 }A4-5:{0.9, 0.2, 0.25, 12, 4, G7 }A4-6:{0.9, 0.2, 0.25, 8, 4, G10}



Min-AED Hyper-ArchMin-AED Hyper-Arch

Vdd (V)

CVt (V)

IVt (V)

(N,K)Sleep

transistor sizeED

(nJ·ns) Normalized

areaAED

reduction %

Baseline 0.9 0.30 0.30 (8,4) - 26.9 1.00 -

Homo-Vt 1.0 0.30 0.30 (6,4) - 23.6 0.80 30.0

Hetero-Vt 0.9 0.30 0.25 (12, 4) - 21.3 0.77 40.0

Hetero-Vt+G 0.9 0.25 0.25 (12, 4) 2 12.4 0.92 57.6

Hetero-Vt+G 0.9 0.20 0.25 (12, 4) 2 12.2 0.92 58.3

Compared to the baseline, the min-AED hyper-arch in the conventional FPGA class can reduce area by 20% and ED by 12.3%

In the Hetero-Vt class, ED is reduced by 20.8% and area is reduced by 23% compared to the baseline

If power gating is applied, ED is reduced by 54.6% and area is reduced by 8.3%

OutlineOutline




Conclusion

Comparison Between Classes in Similar Performance Range Comparison Between Classes in Similar Performance Range

Homo-Vt Hetero-Vt

Vdd Vt (N, K) E (nJ) D (ns)ED

(nJ·ns) Vdd CVt IVt (N, K) E (nJ) D (ns)ED

(nJ·ns)

0.9 0.30 6,6 1.33 18.6 24.8 0.9 0.3 0.35 6,4 1.16 20.1 23.3

0.9 0.3 10,5 1.27 19.8 25 0.9 0.3 0.35 12,4 1.14 20.5 23.7

0.9 0.3 6,4 1.23 21.6 26.5 0.9 0.3 0.35 8,4 1.09 22.1 24.1

Homo-Vt+G Hetero-Vt+G

Vdd Vt (N, K) E (nJ) D (ns)ED

(nJ·ns) Vdd CVt IVt (N, K) E (nJ) D (ns)ED

(nJ·ns)

0.8 0.25 10,5 0.70 19.4 13.7 0.9 0.25 0.3 12,4 0.66 18.9 12.5

0.8 0.25 8,4 0.62 20.9 12.9 0.8 0.25 0.25 8,4 0.62 20.9 12.9

0.8 0.25 12,4 0.62 21 12.9 0.8 0.25 0.25 12,4 0.62 21 12.9

Vt for logic block is lower than Vt for interconnect

Vt for classes with power gating is lower

OutlineOutline




Conclusion

0

1

2

3

4

5

6

7

8

9

10

7 12 17 22 27 32 37 42

Delay (ns)

Ene

rgy

per

Cyc

le (n

J)

D1 Vdd 0.9 Vt 0.25D2 Vdd 0.9 Vt 0.30D3 Vdd 0.9 Vt 0.35D4 Vdd 1.0 Vt 0.25D5 Vdd 1.0 Vt 0.30D6 Vdd 1.0 Vt 0.35D7 Vdd 1.1 Vt 0.30D8 Vdd 1.1 Vt 0.35D8

D7

D6

D4

D5

D3

D1 D2

Dom-Archs under Different Device SettingsDom-Archs under Different Device Settings

For a given device setting architecture tuning changes delay and energy in a smaller range

Device tuning has a much more impact on delay and energy

OutlineOutline




Conclusion

Conclusion and DiscussionConclusion and Discussion

Trace-based estimator provides efficient and accurate FPGA power and delay estimation Average power error is 3.4% and average delay error is 6.1%

Device and architecture co-optimization reduces ED by 20.5% and area by 23.3% when there is no power gating

With power gating, device and architecture co-optimization reduces ED by 54.6% and area by 8.3%

Device tuning has a more significant impact on delay and power than architecture tuning does

In recent research, Ptrace has been extended to consider leakage and timing yield with process variations

Device and Architecture Co-Optimization for FPGA Power Reduction

Documents

Transcript of Device and Architecture Co-Optimization for FPGA Power Reduction