Download - On-Line Adjustable Buffering for Runtime Power Reduction ( vlsicad.ucsd )

Transcript
Page 1: On-Line Adjustable Buffering for Runtime Power Reduction ( vlsicad.ucsd )

UC San Diego Computer Engineering • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory UC San Diego Computer Engineering • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory • UC San Diego Computer Engineering • • UC San Diego Computer Engineering • VLSI CAD Laboratory VLSI CAD Laboratory

U

C S

an D

ieg

o C

om

pute

r Eng

ineeri

ng

UC

San D

ieg

o C

om

pute

r Eng

ineeri

ng

• V

LSI C

AD

Lab

ora

tory

VLS

I C

AD

Lab

ora

tory

• U

C S

an D

ieg

o C

om

pute

r Eng

ineeri

ng

• U

C S

an D

ieg

o C

om

pute

r Eng

ineeri

ng

• V

LSI C

AD

Lab

ora

tory

VLS

I C

AD

Lab

ora

tory

• U

C S

an D

ieg

o C

om

pute

r Eng

ineeri

ng

• U

C S

an D

ieg

o C

om

pute

r Eng

ineeri

ng

• V

LSI C

AD

Lab

ora

tory

VLS

I C

AD

Lab

ora

tory

U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

VLS

I CA

D La

bora

tory

V

LSI C

AD

Lab

ora

tory

• U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

• U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

VLS

I CA

D La

bora

tory

V

LSI C

AD

Lab

ora

tory

• U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

• U

C S

an D

ieg

o C

om

pute

r Eng

ineerin

g •

VLS

I CA

D La

bora

tory

V

LSI C

AD

Lab

ora

tory

••

••••

••

• Presented new technique to dynamically trade-off power-Presented new technique to dynamically trade-off power-performance that turns performance that turns offoff devices not needed at less than peak devices not needed at less than peak performanceperformance

• Both leakage and dynamic power reduce; total power reduction Both leakage and dynamic power reduce; total power reduction is 6-12% on our testcasesis 6-12% on our testcases

• By sharing of LPM devices, area overhead reduced to <5.57%By sharing of LPM devices, area overhead reduced to <5.57%• No adverse affect on performance of the circuit when LPM signal No adverse affect on performance of the circuit when LPM signal is is offoff..

Ongoing work:Ongoing work:• Actual layout of custom repeater with routing of V’Actual layout of custom repeater with routing of V’DDDD, V’, V’SSSS, LPM , LPM

nets to accurately estimate power, performance, area overheadnets to accurately estimate power, performance, area overhead• Customizing more cells especially clock repeaters to further Customizing more cells especially clock repeaters to further improve power-performance trade-off.improve power-performance trade-off.

• Presented new technique to dynamically trade-off power-Presented new technique to dynamically trade-off power-performance that turns performance that turns offoff devices not needed at less than peak devices not needed at less than peak performanceperformance

• Both leakage and dynamic power reduce; total power reduction Both leakage and dynamic power reduce; total power reduction is 6-12% on our testcasesis 6-12% on our testcases

• By sharing of LPM devices, area overhead reduced to <5.57%By sharing of LPM devices, area overhead reduced to <5.57%• No adverse affect on performance of the circuit when LPM signal No adverse affect on performance of the circuit when LPM signal is is offoff..

Ongoing work:Ongoing work:• Actual layout of custom repeater with routing of V’Actual layout of custom repeater with routing of V’DDDD, V’, V’SSSS, LPM , LPM

nets to accurately estimate power, performance, area overheadnets to accurately estimate power, performance, area overhead• Customizing more cells especially clock repeaters to further Customizing more cells especially clock repeaters to further improve power-performance trade-off.improve power-performance trade-off.

Problem:Problem: High performance when LPM signal on High performance when LPM signal on use large use large LPM devices LPM devices large area overhead large area overheadSolution:Solution: Share LPM devices among multiple repeatersShare LPM devices among multiple repeaters

Fewer LPM devices butFewer LPM devices butvirtual Vvirtual VDDDD (V’ (V’DDDD) and ) and

VVSSSS (V’ (V’SSSS) need routing) need routing

Note: All LPM devices driveNote: All LPM devices driveV’V’DDDD and V’ and V’SSSS

How many LPM devicesHow many LPM devicesneeded?needed?• Compute simultaneousCompute simultaneous

switching rate (SSR) by switching rate (SSR) by finding the max. #repeaters that have overlapping timing finding the max. #repeaters that have overlapping timing windows. Time = windows. Time = OO(RlogR) (R = #repeaters)(RlogR) (R = #repeaters)

• Find total width of all repeater devices (=Find total width of all repeater devices (=WWRR))• For good performance, width of LPM devices = 2xSSRxWFor good performance, width of LPM devices = 2xSSRxWRR

Typical SSR=~10% Typical SSR=~10% small area overhead small area overhead

Problem:Problem: High performance when LPM signal on High performance when LPM signal on use large use large LPM devices LPM devices large area overhead large area overheadSolution:Solution: Share LPM devices among multiple repeatersShare LPM devices among multiple repeaters

Fewer LPM devices butFewer LPM devices butvirtual Vvirtual VDDDD (V’ (V’DDDD) and ) and

VVSSSS (V’ (V’SSSS) need routing) need routing

Note: All LPM devices driveNote: All LPM devices driveV’V’DDDD and V’ and V’SSSS

How many LPM devicesHow many LPM devicesneeded?needed?• Compute simultaneousCompute simultaneous

switching rate (SSR) by switching rate (SSR) by finding the max. #repeaters that have overlapping timing finding the max. #repeaters that have overlapping timing windows. Time = windows. Time = OO(RlogR) (R = #repeaters)(RlogR) (R = #repeaters)

• Find total width of all repeater devices (=Find total width of all repeater devices (=WWRR))• For good performance, width of LPM devices = 2xSSRxWFor good performance, width of LPM devices = 2xSSRxWRR

Typical SSR=~10% Typical SSR=~10% small area overhead small area overhead

We add PMOS-NMOS pair to turn half devices off dynamicallyWe add PMOS-NMOS pair to turn half devices off dynamically

What power components likely to reduce?What power components likely to reduce?• Short-circuit power: During switching, PMOS & NMOS Short-circuit power: During switching, PMOS & NMOS ONON

momentarily momentarily short circuit between V short circuit between VDDDD and V and VSSSS

High when transition time (High when transition time (slewslew) is large) is large• Subthreshold leakage: when one of PMOS-NMOS pair Subthreshold leakage: when one of PMOS-NMOS pair

between Vbetween VDDDD and V and VSSSS ONON

Requirements:Requirements:• Low area overheadLow area overhead

Added PMOS-NMOS pair (Added PMOS-NMOS pair (LPM devices)LPM devices) take area take areaLPMLPM (low-power mode) signal to be routed or locally generated (low-power mode) signal to be routed or locally generatedLayout of the new cell must be simple and low area overheadLayout of the new cell must be simple and low area overhead

• High performance when LPM signal High performance when LPM signal OFFOFFOn-resistance of LPM devices may reduce performanceOn-resistance of LPM devices may reduce performance

• Good power-performance trade-offGood power-performance trade-off

We add PMOS-NMOS pair to turn half devices off dynamicallyWe add PMOS-NMOS pair to turn half devices off dynamically

What power components likely to reduce?What power components likely to reduce?• Short-circuit power: During switching, PMOS & NMOS Short-circuit power: During switching, PMOS & NMOS ONON

momentarily momentarily short circuit between V short circuit between VDDDD and V and VSSSS

High when transition time (High when transition time (slewslew) is large) is large• Subthreshold leakage: when one of PMOS-NMOS pair Subthreshold leakage: when one of PMOS-NMOS pair

between Vbetween VDDDD and V and VSSSS ONON

Requirements:Requirements:• Low area overheadLow area overhead

Added PMOS-NMOS pair (Added PMOS-NMOS pair (LPM devices)LPM devices) take area take areaLPMLPM (low-power mode) signal to be routed or locally generated (low-power mode) signal to be routed or locally generatedLayout of the new cell must be simple and low area overheadLayout of the new cell must be simple and low area overhead

• High performance when LPM signal High performance when LPM signal OFFOFFOn-resistance of LPM devices may reduce performanceOn-resistance of LPM devices may reduce performance

• Good power-performance trade-offGood power-performance trade-off

On-Line Adjustable Buffering for Runtime Power ReductionOn-Line Adjustable Buffering for Runtime Power Reduction( http://vlsicad.ucsd.edu )( http://vlsicad.ucsd.edu )

Puneet SharmaPuneet Sharma†† ([email protected] ([email protected]))Advisor: Prof. Andrew B. KahngAdvisor: Prof. Andrew B. Kahng‡†‡†

Jointly with Mr. Sherief RedaJointly with Mr. Sherief Reda‡‡

††Electrical & Computer EngineeringElectrical & Computer Engineering‡‡Computer Science & EngineeringComputer Science & Engineering

CMOS Power:CMOS Power:•Operational – dynamic and leakageOperational – dynamic and leakage•Standby – leakageStandby – leakage

Approaches to reduce operational power:Approaches to reduce operational power:•Supply voltage (VSupply voltage (VDDDD) scaling) scaling•Dynamic VDynamic VDDDD and frequency scaling (DVFS) and frequency scaling (DVFS)

DVFS used to provide dynamic power-performance tradeoffDVFS used to provide dynamic power-performance tradeoff Switch to low-power mode if high performance not neededSwitch to low-power mode if high performance not needed

VDD already small to reduce dynamic powerVDD already small to reduce dynamic power Dynamic voltage scaling reduces noise marginsDynamic voltage scaling reduces noise margins DVFS difficult to use due to reduced VDVFS difficult to use due to reduced VDDDD

Our approach, like DVFS, provides dynamic low-power, low-Our approach, like DVFS, provides dynamic low-power, low-performance modes performance modes supplement or replace DVFS supplement or replace DVFSKey idea: Key idea: Many devices added for performance not functionality Many devices added for performance not functionality Turn those devices off when high-performance not needed Turn those devices off when high-performance not neededPoor interconnect scaling Poor interconnect scaling large number of repeaters large number of repeatersWe modify repeaters to dynamically adjust their driving capacityWe modify repeaters to dynamically adjust their driving capacity

CMOS Power:CMOS Power:•Operational – dynamic and leakageOperational – dynamic and leakage•Standby – leakageStandby – leakage

Approaches to reduce operational power:Approaches to reduce operational power:•Supply voltage (VSupply voltage (VDDDD) scaling) scaling•Dynamic VDynamic VDDDD and frequency scaling (DVFS) and frequency scaling (DVFS)

DVFS used to provide dynamic power-performance tradeoffDVFS used to provide dynamic power-performance tradeoff Switch to low-power mode if high performance not neededSwitch to low-power mode if high performance not needed

VDD already small to reduce dynamic powerVDD already small to reduce dynamic power Dynamic voltage scaling reduces noise marginsDynamic voltage scaling reduces noise margins DVFS difficult to use due to reduced VDVFS difficult to use due to reduced VDDDD

Our approach, like DVFS, provides dynamic low-power, low-Our approach, like DVFS, provides dynamic low-power, low-performance modes performance modes supplement or replace DVFS supplement or replace DVFSKey idea: Key idea: Many devices added for performance not functionality Many devices added for performance not functionality Turn those devices off when high-performance not needed Turn those devices off when high-performance not neededPoor interconnect scaling Poor interconnect scaling large number of repeaters large number of repeatersWe modify repeaters to dynamically adjust their driving capacityWe modify repeaters to dynamically adjust their driving capacity

Experimental SetupExperimental SetupCircuitsCircuits: s38417 (8,890 cells), AES (15,272), OpenRisc (46,732): s38417 (8,890 cells), AES (15,272), OpenRisc (46,732)ToolsTools: Synopsys HSPICE (SPICE), Design Compiler (synthesis, : Synopsys HSPICE (SPICE), Design Compiler (synthesis, timing and power analysis), Cadence SoC Encounter (P&R), timing and power analysis), Cadence SoC Encounter (P&R), SignalStorm (library characterization), TSMC 90nm library modelsSignalStorm (library characterization), TSMC 90nm library modelsOther settingsOther settings: power and timing analysis at slow corner, V: power and timing analysis at slow corner, VDDDD of of

1.1V and 0.9V, activity factor of 0.01.1.1V and 0.9V, activity factor of 0.01.

Power Reduction ResultsPower Reduction Results• Cell-level results: when LPM signal is turned Cell-level results: when LPM signal is turned ONON

• 20-20% reduction in leakage20-20% reduction in leakage• 15-30% reduction in short-circuit power (for same slew)15-30% reduction in short-circuit power (for same slew)• 45-65% increase in delay45-65% increase in delay

• Circuit-level results:Circuit-level results:

• Both dynamic and leakage power reduceBoth dynamic and leakage power reduce• 6-12% reduction in total power at low performance modes6-12% reduction in total power at low performance modes

Area Overhead EstimationArea Overhead Estimation• Area overhead due to LPM devices is 0.91% to 5.57%. May be Area overhead due to LPM devices is 0.91% to 5.57%. May be

smaller as LPM devices placeable in whitespace.smaller as LPM devices placeable in whitespace.• Routing overhead: V’Routing overhead: V’DDDD and V’ and V’SSSS nets routed as min. Steiner nets routed as min. Steiner

trees and found shorter than scanchain; LPM signal has short trees and found shorter than scanchain; LPM signal has short wirelength as #LPM devices is small.wirelength as #LPM devices is small.

Experimental SetupExperimental SetupCircuitsCircuits: s38417 (8,890 cells), AES (15,272), OpenRisc (46,732): s38417 (8,890 cells), AES (15,272), OpenRisc (46,732)ToolsTools: Synopsys HSPICE (SPICE), Design Compiler (synthesis, : Synopsys HSPICE (SPICE), Design Compiler (synthesis, timing and power analysis), Cadence SoC Encounter (P&R), timing and power analysis), Cadence SoC Encounter (P&R), SignalStorm (library characterization), TSMC 90nm library modelsSignalStorm (library characterization), TSMC 90nm library modelsOther settingsOther settings: power and timing analysis at slow corner, V: power and timing analysis at slow corner, VDDDD of of

1.1V and 0.9V, activity factor of 0.01.1.1V and 0.9V, activity factor of 0.01.

Power Reduction ResultsPower Reduction Results• Cell-level results: when LPM signal is turned Cell-level results: when LPM signal is turned ONON

• 20-20% reduction in leakage20-20% reduction in leakage• 15-30% reduction in short-circuit power (for same slew)15-30% reduction in short-circuit power (for same slew)• 45-65% increase in delay45-65% increase in delay

• Circuit-level results:Circuit-level results:

• Both dynamic and leakage power reduceBoth dynamic and leakage power reduce• 6-12% reduction in total power at low performance modes6-12% reduction in total power at low performance modes

Area Overhead EstimationArea Overhead Estimation• Area overhead due to LPM devices is 0.91% to 5.57%. May be Area overhead due to LPM devices is 0.91% to 5.57%. May be

smaller as LPM devices placeable in whitespace.smaller as LPM devices placeable in whitespace.• Routing overhead: V’Routing overhead: V’DDDD and V’ and V’SSSS nets routed as min. Steiner nets routed as min. Steiner

trees and found shorter than scanchain; LPM signal has short trees and found shorter than scanchain; LPM signal has short wirelength as #LPM devices is small.wirelength as #LPM devices is small.

Problem:Problem: Custom repeaters ~5% slower when LPM signal Custom repeaters ~5% slower when LPM signal OFFOFF Up to ~5% reduction in circuit performanceUp to ~5% reduction in circuit performance

Solution:Solution: use custom repeaters only on non- timing-critical pathsuse custom repeaters only on non- timing-critical pathsAdditional constraint: slew constraints not violated when LPM Additional constraint: slew constraints not violated when LPM signal is signal is OFF OFF or or ON.ON.We characterize custom repeaters (i.e., find delay, slew, power, We characterize custom repeaters (i.e., find delay, slew, power, input capacitance) and then perform remapping with synthesis input capacitance) and then perform remapping with synthesis tool subject to delay and slew constraints.tool subject to delay and slew constraints. No loss in circuit performance & no slew violationsNo loss in circuit performance & no slew violations

Problem:Problem: Custom repeaters ~5% slower when LPM signal Custom repeaters ~5% slower when LPM signal OFFOFF Up to ~5% reduction in circuit performanceUp to ~5% reduction in circuit performance

Solution:Solution: use custom repeaters only on non- timing-critical pathsuse custom repeaters only on non- timing-critical pathsAdditional constraint: slew constraints not violated when LPM Additional constraint: slew constraints not violated when LPM signal is signal is OFF OFF or or ON.ON.We characterize custom repeaters (i.e., find delay, slew, power, We characterize custom repeaters (i.e., find delay, slew, power, input capacitance) and then perform remapping with synthesis input capacitance) and then perform remapping with synthesis tool subject to delay and slew constraints.tool subject to delay and slew constraints. No loss in circuit performance & no slew violationsNo loss in circuit performance & no slew violations

• Power-performance for circuitPower-performance for circuitAES shownAES shown

• Utilize slack to reduce powerUtilize slack to reduce powerwhen high performance notwhen high performance notneededneeded

• Power lowered or unchangedPower lowered or unchangedwith LPMwith LPM

• Alternatively, unchanged orAlternatively, unchanged orhigher performance givenhigher performance givenpower budgetpower budget

• Higher performance per wattHigher performance per watt

• Power-performance for circuitPower-performance for circuitAES shownAES shown

• Utilize slack to reduce powerUtilize slack to reduce powerwhen high performance notwhen high performance notneededneeded

• Power lowered or unchangedPower lowered or unchangedwith LPMwith LPM

• Alternatively, unchanged orAlternatively, unchanged orhigher performance givenhigher performance givenpower budgetpower budget

• Higher performance per wattHigher performance per watt

Restricting Area OverheadRestricting Area OverheadIntroductionIntroduction

Custom Repeater DesignCustom Repeater DesignEnsuring High PerformanceEnsuring High Performance

Power-Performance TradeoffPower-Performance Tradeoff

Experimental ValidationExperimental Validation

Conclusions & Ongoing WorkConclusions & Ongoing WorkTraditional InverterTraditional Inverter Custom InverterCustom Inverter

LPM devices shared by two invertersLPM devices shared by two inverters

Power-performance w/ DVFS &Power-performance w/ DVFS &DVFS combined w/ LPMDVFS combined w/ LPM

AES

2

2.5

3

3.5

4

4.5

445 438 432 389 354 349 343 337Frequency (MHz)

Tota

l Pow

er (m

W)

DVFS

DVFS+LPM

OpenRisc

77.5

88.5

99.510

10.511

11.5

192 187 181 173 164 159 154 149

Frequency (MHz)

To

tal

Po

wer

(m

W)

DVFS

DVFS+LPM