Word-Size Optimization for Low Energy, Variable Workload Sub-threshold Systems
description
Transcript of Word-Size Optimization for Low Energy, Variable Workload Sub-threshold Systems
Word-Size Optimization for Low Energy, Variable Workload
Sub-threshold Systems
Sudhanshu Khanna, Anurag Nigam
ECE 632 – Fall 2008University of Virginia
<sk4fs, an2z>@virginia.edu
– Energy constrained Sub-Vt systems• Medical devices• Environmental sensors
– Need to lower E in order to enable “lifelong” operation
– SMALL “FORM-FACTOR” => Area Reduction
– Total E = Active E + Sleep E
Introduction
Top Level Problems Addressed
• Energy Reduction– Active
–Sleep Mode
• Area Reduction
• Adaptation of Super-threshold designs to sub-threshold
Current Approaches
Voltage Regulated from THIS off-chip, (expensive) DC-DC converter
Ref: K.Craig, R.Matthews, EE632 Fall 2008
Our approach
Make the “starting point” design more E-efficient, Specifically for Sleep Mode operation
Sure way of lowering CV2 : Lower V => Sub-threshold
1.2V 0.2V
Logic System
Logic System
Can we optimize the Logic system for sub-Vt operation, or should
it be the same
Sure way of lowering CV2 : Lower V => Sub-threshold
1.2V
0.2V
Logic System
Smaller Logic
System
Make the system as small as feasible.
Use it over and over till the required operation is done.
Then goto sleep and leak less !!
How do we make the system smaller: USE A SMALLER WORD-SIZE
Will using the SMALL system over and over increase the ACTIVE Energy???
Smaller Word-Size: Problems Addressed
• For Sure, small word-size means:– Lower Area– Lower Sleep Energy– Higher Delay
• We need to find:– How much is the Area/Sleep E benefit ?– Impact of multi-cycle operation on Active E ??– Can we somehow make them faster without
losing the Sleep E and Area advantage ???
Smaller Word-Size: Our Contribution
• For Sure, small word-size means:– Lower Area– Lower Sleep Energy– Higher Delay
• We need to find:– How much is the Area/Sleep
E benefit ?
– Impact of multi-cycle operation on Active E ??
– Can we somehow make them faster without losing the Sleep E and Area advantage ???
> 20x area benefit
> 33x sleep energy benefit
Multi-cycle operation increases Active E
But the final value of the Active E is about the same/lesser than that
of a 32-bit system.
Yes, delay degradation can be overcome !!! while still being
more energy efficient
Systems Compared
• Addition of two 32-bit numbers using:– Large word-size (32-bit)
• Kogge-Stone Adder• Ripple Carry Adder• Full-Adder
– Small word-size (1-bit)
• 1-bit taken for simplicity, the trends would be valid for other word-sizes e.g. 16-bit, 8-bit etc.
• Addition is taken as a sample digital function. However, trends founds can be generalized to other digital functions as well.
32-bit Kogge-Stone Adder (KSA), 32-bit Ripple Carry Adder (RCA)
32 Bit Register
32 Bit Register
32 Bit Register32 Bit KSA or RCA
PA
PB
Reset
Reset
CLK
CLK
CLK
PA = Parallel input A
PB = Parallel input B
OUT = Parallel output from Sum Register
32 Bit
32 Bit
32 Bit
OUT
Small-Word Size system
n-bit Full Addern-Bit Register n-Bit Register
n-Bit Register
CLK
In general, an n-bit word system will have n-bit operands
Let the smaller word-size be n. Then the system will look like this:
Just like a 32-bit system, but only smaller!
n < 32
In case n = 1, the system will take 32 clock cycles to add two 32-bit numbers. Hence the higher delay.
1-bit Full Adder1-Bit Register 1-Bit Register
1-Bit Register
CLK
n = 1
1-bit Serial Adder (SA)
Serial ADC 1-bit Full Adder
1-Bit RegisterSerial DAC
Serial Multiplier
CLK
Analog Input
Analog Output
CLK
1-Bit Register 1-Bit Register
1-Bit Register
1-Bit Register
1-bit input from other part of chip
1-bit input from other part of chip
Simulated 1-bit SA
A conceptual fully-serial 1-bit system
32-bit Serial Adder (SA)using Full-Adder
32 Bit Shift Register
32 Bit Shift Register
32 Bit Shift Register1 Bit Full Adder
Carry Flip Flop
PA
PB
CLK
CLK
CLK
Cin
Cout
Regular 32-bit word system,
But parallel adder replaced by 1-bit full adder => LOWER SLEEP ENERGY
Takes 32 cycles but is amenable for use in a an un-modified 32-bit word system
1 Bit
1 Bit
1 Bit
OUT
• Energy drawn for addition of two 32-bit numbers is measured for all the 4 systems:– 32-bit KSA– 32-bit RCA– 32-bit SA
– 1-bit SA
• Clock and register power taken into account
Important Metric: Energy per operation
Large word-size systems
Small word-size system
Active Energy @ VDD = 300mV
0.00
1.00
2.00
1-bit SA90nm
32-bitKSA90nm
32-bitRCA90nm
32-bitSA
90nm
1-bit SA22nm
32-bitKSA22nm
32-bitRCA22nm
32-bitSA
22nm
E (
pJ)
ElkgEdyn
HIGH Edyn ~ Etot ~ 6pJ
But leakage current is 1.7x
lower
Shows that active energy of 1-bit system < 32-bit systems
40% active energy benefit @ 22nm
33x reduction in leakage current (note that above plot is only showing active energy)
Conclusions @ 300mV
• 1-bit SA has 40% lower active E than the best 32-bit system
• 1-bit SA has 33x lesser leakage current than the best 32-bit system
• 32-bit SA has 1.7x lesser leakage current than 32-bit KSA
Thus multi-cycle operation doesn’t increase active energy too much
Hence once sleep time is added, benefits of small-word systems will increase
Hence once sleep time is added, benefits of small-word systems will increase
=> if word-size limited to 32, serial addition will save energy if the application has lot of sleep time e.g. in sensor nodes !!!
=> if word-size limited to 32, serial addition will save energy if the application has lot of sleep time e.g. in sensor nodes
Hence once sleep time is added, benefits of small-word systems will increase
Hence once sleep time is added, benefits of small-word systems will increase
Logic System
small word
VDD incs => delay decs• Can be used to make small-word size systems
faster !!!• But, impact of the VDD increase on Energy ???
0.4V
1.2V
0.2V
Logic System
Logic System
0.2V
Already compared
Logic System
small word
Energy @ constant delay
• Delay is equal
• Now we compare energy at constant delay
Small word-size more energy efficient even after the VDD increase
But the margins of energy benefits do go down
The same is not true in super-Vt ! WHY???
Difference in On-Current Equation in super-Vt and sub-Vt
0.2V
Logic System
0.4V
Logic System
small word
1.00E+01
1.00E+03
1.00E+05
1.00E+07
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
VDD (in V)
Del
ay (
in p
S)
SMALL SLOPE
LARGE SLOPE
1.00E+01
1.00E+03
1.00E+05
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
VDD (in V)
En
erg
y (i
n p
J)
SMALL SLOPE
LARGE SLOPE
Sub-Vt Super-Vt
VDD change => no impact on E !!
Pareto-Optimal E-D Curve
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05
Delay
En
erg
y
32-bit KSA
1-bit SA
Super-Vt -> 32-bit system is pareto-optimal
Sub-Vt -> 1-bit system is pareto-optimal
Cross-over: 1-bit system becoming optimal
Super-Vt Sub-Vt
Generality of Trends• 1-bit system is used as an example. Energy and
area benefits will be achieved in any small word-size system.
• Shift in pareto-optimal curve happens because of difference in Ion equation.
• Hence this behavior can be observed in other parts of a digital system as well, and not just addition.
Opens energy saving opportunities in more areas of digital design
Logic System
small word
Conclusions @ constant delay• While going into sub-Vt operation, re-look the word-size
of the system being used.
• Optimal word-size goes down: Small word size gives lower E and Area and matches delay
0.2V
Logic System
0.4V
Energy less
Leakage less
Area ($$$) less
Delay Same
Different Word-Size Systems•1-bit ( Digital Audio System – Sharp)
• 4-bit ( Marc4 Micro controller, Intel 4040)
• 8-bit ( Micro controllers, Intel 8080 processor)
• 16-bit ( Intel 8086 processor)
• 64-bit ( Athlon 64, Opteron processor)
FIR Filter
• Used in many real time DSP systems ( audio, video processing)
4-Tap FIR Filter
3
0
)()()(i
inXiKnY
K(i): Filter Coefficients
• Serial Implementation of a Parallel FIR filter
Delay Delay Delay
Multiplier Multiplier Multiplier Multiplier
4-inputParallel Adder
X(n) X(n-2)X(n-1) X(n-3)
K0 K3K2K1
Y(n)
K0 , K1 ,K2 ,K3 : Filter Coefficients
Stored in memory
Serial Parallel Multiplier
1-bit Serial Adder
Register
Y(n)
Filter Coefficients
(K3, K2, K1, K0)
X(n): serial input data
Serial output
From memory
QUESTIONS