EECS 470 Power and ArchitecturePower and...
Transcript of EECS 470 Power and ArchitecturePower and...
EECS 470Power and ArchitecturePower and Architecture
on
Slides taken from Prof. David Brooks, Harvard University (2004), Modified by Mark Brehob & Thomas Wenisch.
Intr
oduc
tio
Thanks to Prof. Brooks for kindly sending his slides!
I
Announcements
• HW 6 Posted, due 12/7
• Project due 12/10j• In-class presentations (~8 minutes + questions)
2
Outline• Why is power a problem?• What uses power in a chip?• What uses power in a chip?• How can we reduce power?
onIn
trod
uctio
I
3
Outline• Why is power a problem?• What uses power in a chip?• What uses power in a chip?• Relationship between power and
fon performance.• How can we reduce power?In
trod
uctio
pI
4
Why is power a problem in a μP?• Power used by the μP, vs. system power• Dissipating Heat
• Melting (very bad)• Packaging (to cool $)
on • Heat leads to poorer performance.• Providing Power
B ttIntr
oduc
tio
• Battery• Cost of electricity
I
5
Where does the juice go in laptops?
• Others have measured ~55% processor
6
Others have measured 55% processor increase under max load in laptops [Hsu+Kremer, 2002]
Why worry about power dissipation?
Batteryblem
?
Thermal issues: affect
Batterylife
wer
a p
rob
Thermal issues: affect cooling, packaging, reliability, timing
Why
is p
ow
Environment
W
7
Total Power Dissipation Trends
1000
2 )
100
(W/c
m Nuclear Reactor
Pentium 4 (Prescott)blem
?
10ensi
ty (
Hot Plate
Pentium 2
Pentium 3
( )Pentium 4
wer
a p
rob
10
ower
D
486Pentium
Pentium Pro
Why
is p
ow
11980 1990 2000 2010
P 386486W
8
Packaging costFrom Cray (local power generator and refrigeration)…
blem
?w
er a
pro
bW
hy is
pow
W
10Source: Gordon Bell, “A Seymour Cray perspective”http://www.research.microsoft.com/users/gbell/craytalk/
Packaging costT t dTo today…• IBM S/390: refrigeration:
• Provides performance (2% perf for 10ºC) andProvides performance (2% perf for 10 C) and reliability
blem
?w
er a
pro
bW
hy is
pow
W
11Source: R. R. Schmidt, B. D. Notohardjono “High-end server low temperature cooling”IBM Journal of R&D
Intel Itanium packaging
Complex and expensive (note heatpipe)
blem
?w
er a
pro
bW
hy is
pow
W
12
Source: H. Xie et al. “Packaging the Itanium Microprocessor”Electronic Components and Technology Conference 2002
P4 packaging
• Simpler, but still…40
30
-Rel
ated
Cos
t ($
)
blem
?
10
20
otal
Pow
erP
C S
yste
m
wer
a p
rob
00 10 20 30 40
To P
Why
is p
ow
Source: Intel web site
Power (Watts)
From Tiwari, et al., DAC98
W
13
Source: Intel web site
The Battery Gap
Diverging Gap Between Actual Battery Capacities and Energy Needs
500010kbps 64kbps 384kbps 2Mbps
Mobile video-Interactiveblem
?
3000
4000
Batterycapacity
Video email,Voice recognition,Mobile commerce
Conferencing,Collaboration
Downlink
Interactive
mA
h)
wer
a p
rob
2000
capacity(mAh)
Energyrequirement(mAh)
PIM, SMS,Voice
Mobile commerceDownlink dominated Fuel Cells
Web browser,MMS, Video clipsEn
ergy
(m
Why
is p
ow
0
1000( )
Lithium Ion
Lithium Polymer
E
Source:Anand
W
16
02000 2001 2002 2003 2004 2005 2006 2007 Raghunathan,
NEC Labs
Power-Aware Computing Applicationsst
rain
ed-d
t-Con
sat
ure/
di-
Tem
pera
17
Energy-Constrained Computing
Server Farms• Internet data centers are like heavy-duty
factories• e g small Datacenter 25 000 sq feet 8000• e.g. small Datacenter 25,000 sq.feet, 8000
servers, 2MegaWatts• Intergate Datacenter, Tukwila, WA: 1.5 Mill. Sq.Ft,
blem
?
~500 MW• Wants lowest net cost per server per sq foot of
data center spacewer
a p
rob
data center space• Cost driven by:
• Racking heightWhy
is p
ow
ac g e g t• Cooling air flow• Power delivery
W
18
• Maintenance ease (access, weight)• 25% of total cost due to power
Environment• Environment Protection Agency (EPA): computers
consume 10% of commercial electricity consumptionp
• This incl. peripherals, possibly also manufacturing• A DOE report suggested this percentage is much lower
(3.0-3.5%)blem
?
(3.0 3.5%)• No consensus, but it’s still a lot• Interesting to look at the numbers:
htt // d lbl / j t /i f t h ht lwer
a p
rob
– http://enduse.lbl.gov/projects/infotech.html• Data center growth was cited as a contribution to the
2000/2001 California Energy CrisisWhy
is p
ow
• Equivalent power (with only 30% efficiency) for AC• CFCs used for refrigeration
W
19
• Lap burn• Fan noise
Power-Aware Needed across all computing platforms
• Mobile/portable (cell phones, laptops, PDA)• Battery life is critical
• Desktops/Set-Top (PCs and game machines)
P k i t i iti lblem
?
• Packaging cost is critical• Servers (Mainframes and compute-farms)
Packaging limitswer
a p
rob
• Packaging limits• Volumetric (performance density)
Why
is p
owW
20
Outline• Why is power a problem?• What uses power in a chip?• What uses power in a chip?• Relationship between power and
fperformance.• How can we reduce power?p
In order to tackle this one, we need to learn a bit about circuits, power, and energy.
21
CMOS Water Analogy
Electron: water moleculeCharge: weight of waterVoltage: heighta
chip
?
g gCurrent: flow rateCapacitance: container cross-sectionpo
wer
in a
p
(Think of power-plants that store energy inWha
t use
s
(Think of power plants that store energy in water towers)
W
26
Liquid Inverter• Capacitance at input
• Gates of NMOS, PMOSMetal interconnect• Metal interconnect
• Capacitance at output• Fanout (# connections) to a
chip
?
( )other gates
• “Diffusion” capacitance of tx• Metal Interconnectpo
wer
in a
NMOS conducts when water level is above switchingW
hat u
ses
level is above switching threshold
PMOS conducts below
W
27
No conduction after container full
Delay and Energy Definitions• Propagation Delay
• Time to fill output container to 50%• Time to charge output capacitor to 50%
• Switching Energy
a ch
ip?
• Weight * height of water moved• Charge * voltage of charge transferred
pow
er in
aW
hat u
ses
W
30
Delay and Power Observations• Load capacitance increases delay
• High fanout (gates attached to output)• Interconnection
• Higher current can increase speed
a ch
ip?
• Increasing transistor width raises currents but also raises capacitance
• Energy per switching event independent ofpow
er in
a
• Energy per switching event independent of current
• Depends on amount of charge moved, not rateWha
t use
s
gW
31
Feedback-based Latcha
chip
?po
wer
in a
• Pros:• Holds data as long as power applied
( f )Wha
t use
s
• Actively drives output: (can be fast)• Con: Fairly big (5 transistors)
C b d f l h SRAM ll
W
32
• Can be used for latches or SRAM cells
Charge-based Latcha
chip
?po
wer
in a
• Pros:• Small: 1 transistor, 1 capacitor (may be gate of tx)
• Con:Wha
t use
s
• Con:• Charge “leaks” off capacitor (~1ms)• Reads can be destructive (must read follow by write)
W
33
• Can be used for latches or DRAM cells
Power: The Basics• Dynamic power vs. Static power
• Dynamic: “switching” powerSt ti “l k ”• Static: “leakage” power
• Dynamic power dominates, but static power increasing in importance
a ch
ip?
• Trends in each• Static power: steady, per-cycle energy cost• Dynamic power: capacitive and short-circuitpo
wer
in a
• Dynamic power: capacitive and short-circuit• Capacitive power: charging/discharging at
transitions from 0 1 and 1 0
Wha
t use
s
• Short-circuit power: power due to brief short-circuit current during transitions.
• Most research focuses on capacitive but recent
W
34
• Most research focuses on capacitive, but recent work on others
Dynamic (Capacitive) Power Dissipation
I
VOUT
I
VIN
a ch
ip?
CL
pow
er in
a
• Data dependent – a function of switchingti itW
hat u
ses
activityW
35
Capacitive Power dissipation
Capacitance:Function of wire length transistor size
Supply Voltage:Has been dropping with successive faba
chip
?
P ½ CV2Af
length, transistor size with successive fab generations
pow
er in
a
Power ~ ½ CV2AfClock frequency:W
hat u
ses
Clock frequency:Increasing…Activity factor:
How often, on average, d i i h?
W
36
do wires switch?
Lowering Dynamic Power• Reducing Vdd has a quadratic effect
• Has a negative (~linear) effect on performance howeverhowever
• Lowering CL• May improve performance as wella
chip
?
• May improve performance as well• Keep transistors small (keeps intrinsic
capacitance (gate and diffusion) small)pow
er in
a
• Reduce switching activity• A function of signal transition stats and clock
tWha
t use
s
rate• Clock Gating idle units• Impacted by logic and architecture decisions
W
37
• Impacted by logic and architecture decisions
Short-Circuit Power Dissipation
ISC
VOUT
ISC
VIN
a ch
ip?
CL
pow
er in
a
• Short-Circuit Current caused by finite-slope Wha
t use
s
y pinput signals
• Direct Current Path between VDD and GND when both NMOS and PMOS transistors are
W
38
when both NMOS and PMOS transistors are conducting
Short-Circuit Power Dissipation
PowerSC ~ tscVIpeak
• Power determined by• Duration and slope of input signal, tsca
chip
?
• Ipeak determined by transistor sizes, process technology, CL
Sh t i it b i i i dpow
er in
a
• Short circuit power can be minimized• Try to match rise/fall times of input and output
signalsWha
t use
s
signals• Have not seen many architectural solutions here• Good news: relatively, PowerSC is shrinking
W
39
Leakage Currents
Vq T⋅−
I t
VOUT
CLISub
VIN TkaDSub aekI ⋅⋅⋅=
a ch
ip?
• Subthreshold currents grow exponentially with increases in t t d i th h ld lt
Igate
pow
er in
a
temperature, decreases in threshold voltage• But threshold voltage scaling is key to circuit performance!
• Gate leakage primarily dependent on gate oxide thickness, biasesW
hat u
ses
biases• Both type of leakage heavily dependent on stacking and input
pattern• More on leakage later in the semester
W
40
More on leakage later in the semester
Gate vs. Subthreshold Leakage TrendsTrends
a ch
ip?
pow
er in
aW
hat u
ses
W
41From Mukhopadhyay, et al. TVLSI ‘03
Lowering Static Power• Design-time Decisions
• Use fewer, smaller transistors -- stack when possible to minimize contacts with Vdd/Gndpossible to minimize contacts with Vdd/Gnd
• Multithreshold process technology (multiple oxides too!)
a ch
ip?
– Use “high-Vt” slow transistors whenever possible
Dynamic Techniquespow
er in
a
• Dynamic Techniques• Reverse-Body Bias (dynamically adjust threshold)
– Low-leakage sleep mode (maintain state) e gWha
t use
s
– Low-leakage sleep mode (maintain state), e.g. XScale
• Vdd-gating (Cut voltage/gnd connection to circuits)
W
42
– Zero-leakage sleep mode– Lose state, overheads to enable/disable
What do we mean by Power?a
chip
?po
wer
in a
• Max Power: Artificial code generating max CPU activity
Wha
t use
s
• Worst-case App Trace: Practical applications worst-case• Thermal Power: Running average of worst-case app power over a
time period corresponding to thermal time constant
W
43
• Average Power: Long-term average of typical apps (minutes)• Transient Power: Variability in power consumption for supply net
Power vs. Energy• Power consumption in Watts
• Determines battery life in hours• Sets packaging limits
• Energy efficiency in joules
a ch
ip?
• Rate at which energy is consumed over time• Energy = power * delay (joules = watts *
seconds)pow
er in
a
seconds)• Lower energy number means less power to
perform a computation at same frequency
Wha
t use
s W
44
Power vs. Energy• Power-delay Product (PDP) = Pavg * t
• PDP is the average energy consumed per it hi tswitching event
• Energy-delay Product (EDP) = PDP * t• Takes into account that one can tradea
chip
?
• Takes into account that one can trade increased delay for lower energy/operation
• Energy-delay2 Product (EDDP) = EDP * tpow
er in
a
gy y ( )• Why do we need so many formulas?!!?• We want a voltage-invariant efficiency
Wha
t use
s
metric! Why?• Power ~ ½ CV2Af, Performance ~ f (and V)
W
46
E vs. EDP vs. ED2P• Power ~ CV2f ~ V3 (fixed microarch/design)• Performance ~ f ~ V (fixed
microarch/design)• (For the nominal voltage range, f varies
appro linearl ith V)a ch
ip?
approx. linearly with V)
C i th t l
pow
er in
a
• Comparing processors that can only use freq/voltage scaling as the primary method of power control:W
hat u
ses
of power control:• (perf)3 / power, or MIPS3 / W or SPEC3 /W is a
fair metric to compare energy efficiencies.
W
47
• This is an ED2 P metric. We could also use: (CPI)3 * W for a given application
E vs. EDP vs. ED2P• Currently have a processor design:
• 80W, 1 BIPS, 1.5V, 1GHz• Want to reduce power, willing to lose some
performance• Cache Optimization:a
chip
?
• Cache Optimization:– IPC decreases by 10%, reduces power by
20% => Final Processor: 900 MIPS, 64Wpow
er in
a
–Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x
• Energy is better but is this a “better”Wha
t use
s
• Energy is better, but is this a better processor?
W
48
Not necessarily• 80W, 1 BIPS, 1.5V, 1GHz
• Cache Optimization:– IPC decreases by 10% reduces power by 20% =>– IPC decreases by 10%, reduces power by 20% =>
Final Processor: 900 MIPS, 64W– Relative E = MIPS/W (higher is better) = 14/12.5 =
1.125xa ch
ip?
1.125x– Relative EDP = MIPS2/W = 1.01x– Relative ED2P = MIPS3/W = .911x
Wh t if j t dj t f / ltpow
er in
a
• What if we just adjust frequency/voltage on processor?
• How to reduce power by 20%?Wha
t use
s
• P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x
• So for equal power (64W)
W
49
– Cache Optimization = 900MIPS– Simple Voltage/Frequency Scaling = 930MIPS
Power vs. SPECint2K Performance
140 AMD Athlon
Performance
100
120
(W)
oAMD OpteronPentium 3Pentium 4Apple PowerPCa
chip
?
60
80
100
Dis
sipa
tion
ppItaniumIBM PowerPCUltraSPARC III
pow
er in
a
20
40
60
Pow
er
Wha
t use
s
0
20
0500100015002000
W
50
SPECINT2000
Analysis Abstraction Levels
Abstraction Analysis Analysis Analysis Analysis EnergyLevel Capacity Accuracy Speed Resources Savings
Most Worst Fastest Least MostMost Worst Fastest Least MostApplicationBehavioralArchitectural (RTL)Logic (Gate)T i t (Ci it)Transistor (Circuit)
Least Best Slowest Most Least
51
Modeling Hierarchy and Tool Flow
set of workloadsEnergy Models PerformanceTest Cases
Early analytical performance modelsTrace/exec-driven, cycle-accurate simulation models
microarchlevel
Test Cases
edit/debug
refineMicroarchparms/specs
RTL MODEL (VHDL) RTLRTL
Sim Test Cases(Architectural)
edit/debug
refine,update
RTL MODEL (VHDL)sim
gate-level model (if synthesized)
level
gate-levelBitvectortest cases
Circuit-level (hierarchical) netlist model ckt-level cktextract
test cases
edit/tune/debug
Design rules
sim,
52
Layout-level physical design modellayout-levelCapextract,sim
Design rules
design rulecheck,validate
Power/Performance abstractions• Low-level:
• Hspice• PowerMill• PowerMill
• Medium-Level: • RTL Models
• Architecture-level:• PennState SimplePower• Intel Tempest• Intel Tempest• Princeton Wattch• IBM PowerTimer• Umich/Colorado PowerAnalyzer
53
Low-level models: Hspice• Extracted netlists from circuit/layout
descriptionsDiff i t d i i it i• Diffusion, gate, and wiring capacitance is modeled
• Analog simulation performedAnalog simulation performed• Detailed device models used• Large systems of equations are solvedg y q• Can estimate dynamic and leakage power
dissipation within a few percentSl l ti l f 10 100K t i t• Slow, only practical for 10-100K transistors
• PowerMill (Synopsys) is similar but about 10x faster
54
10x faster
Medium-level models: RTL• Logic simulation obtains switching events
for every signalSt t l VHDL il ith it• Structural VHDL or verilog with zero or unit-delay timing models
• Capacitance estimates performed• Capacitance estimates performed• Device Capacitance
–Gate sizing estimates performed similar toGate sizing estimates performed, similar to synthesis
• Wiring Capacitance–Wire load estimates performed, similar to
placement and routing• Switching event and capacitance estimates
55
• Switching event and capacitance estimates provide dynamic power estimates
Architecture level models• Two major classes:
• Cycle/Event-Based: Arch. Level power models interfaced with cycle-driven performance simulationy p
• Instruction-Based: Measurement/Characterization based on instruction usage and interactions
• Components of Arch Level power modelComponents of Arch. Level power model• Could be based on ckt schematic
measurements/extrapolationOrOr…• Capacitance modelsBoth may need to consider…• Circuit design styles• Clock gating styles & Unit usage statistics• Signal transition statistics
56
g
Architecture level modelsPower ~ ½ CV2Af
• Analytical Approach: • Estimate “CV2f” via analytical models
T l W tt h P A l T t ( i d• Tools: Wattch, PowerAnalyzer, Tempest (mixed-mode)
• Empirical ApproachEmpirical Approach• Estimate “CV2f” via empirical measurements• Tools: PowerTimer, AccuPower, Internal Industrial , ,
Tools
E ti t “A” i t ti ti f hit t l57
• Estimate “A” via statistics from architectural-performance simulators
Analytical Modeling Tools:Modeling CapacitanceModeling Capacitance• Requires modeling wire length and
estimating transistor sizes• Related to RC Delay analysis for speed
along critical pathBut capacitance estimates require summing up• But capacitance estimates require summing up all wire lengths, rather than only an accurate estimate of the longest one.
58
Register File: Capacitance AnalysisAnalysis
Pre-ChargeBit
Cell Access Transistors (N1)
BitD
ecod
ers
Wordlines Cell
D (Number of Entries)
S A Number of Number of
Bitlines(Data Width of Entries)
Sense Amps Number of Ports
Number of Ports
metal
gatecapNrdlineDrivediffcapWorwordline
CngthWordlineleCinesNumberBitlCC
** 1++=
59metal
diffcapNgdiffcapPchbitline
CgthBitlinelenClinesNumberWordCC
** 1
++=
Register File Model: ValidationE rror R ates G ate D iff In terC onn. Tota l W ordline(r) 1 .11 0 .79 15 .06 8 .02 W ordline(w ) -6 .37 0 .79 -10.68 -7 .99( )B itline(r) 2 .82 -10 .58 -19 .59 -10 .91 B itline(w ) -10 .96 -10 .60 7 .98 -5 .96 (N b i P t)• Validated against a register file schematic
used in Intel’s Merced design
(Numbers in Percent)
• Compared capacitance values with estimates from a layout-level Intel tool
• Interconnect capacitance had largest errors
Model currently neglects poly connections
60
• Model currently neglects poly connections• Differences in wire lengths -- difficult to tell
wire distances of schematic nodes
Architecture level models:Signal Transition Statistics• Dynamic power is proportional to switching• How to collect signal transition statistics in
architectural-level simulation?• Many signals are available, but do we want to
use all of them?use all of them?• One solution (register file):
–Collect statistics on the important ones p(bitlines)
– Infer where possible (wordlines)A i b biliti f l i t t–Assign probabilities for less important ones (decoders)
61
Architecture level models:Clock Gating: What, why, when?
Clock Gated Clock
• Dynamic Power is dissipated on clockGate
Dynamic Power is dissipated on clock transitions
• Gating off clock lines when they are g yunneeded reduces activity factor
• But putting extra gate delays into clock lines increases clock skew
• End results:
62
• Clock gating complicates design analysis but saves power.