Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power
description
Transcript of Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power
1
Integrating Adaptive On-Chip Storage Structures for Reduced
Dynamic Power
Steve Dropsho,
Alper Buyuktosunoglu, Rajeev Balasubramonian,
David H. Albonesi, Sandhya Dwarkadas,
Greg Semeraro, Grigorios Magklis, and Michael Scott
ECE and CS Departments
University of Rochester
2
Why Adaptive Structures?
• General purpose uP are “one size fits all”
• But, needs vary across (within) applications
• Can save considerable energy by matching resources to the application
Objective: Less energy for same performanceby adapting storage structures to application
3
Related Work
• Adaptable cache– Balasubramonian et al., MICRO 2000– Dhodapkar and Smith, ISCA 2002
• Adaptable issue logic– Buyuktosunoglu et al., GLS VLSI 2001– Folegnani and Gonzalez, ISCA 2000
4
Common Themes
• A single adaptive structure
• Use of global information for feedback
• Exploration-based (caches)
5
Related Work (cont)
• Adaptable IQ, LSQ, and ROB– Ponomarev et al., MICRO 2001– Three (3) adaptable structures– Reconfigurations based on local state
6
Integrating Multiple Adaptive Structures
L2UnifiedCache
ROBRename
map
FPQ
IPREG
IIQ
LSQL1
Dcache
Branchpredict
L1Icache
Integer
Memory
Floating Pt
FPREG
Int FUs
FP FUs
FetchQ
7
Challenges
• Multiple (9) adaptive structures creates state explosion problem
• Use of global information makes assigning cause and effect difficult
• Potential for additive performance effects among the structures
8
Approach: Local Management
• Local information for configuration decisions
• Tight control over performance variance
9
Part I: The Caches
L2UnifiedCache
ROBRename
map
FPQ
IPREG
IIQ
LSQL1
Dcache
Branchpredict
L1Icache
Integer
Memory
Floating Pt
FPREG
Int FUs
FP FUs
FetchQ
10
The Accounting Cache
A access (primary)
B access (secondary)
• Sequential accesses, A then B• Save energy on A access hit• Swap blocks on A access miss
20 1 3
20 1 3
20 1 3
20 1 3
20 1 3 Swap
A1 B3
A2 B2
A3 B1
A4 B0
11
Most-Recently-Used Statistics
0 1 2 3
Way 1 2 3 4
Line A B C D
0 1 2 3
0 1 2 3
01 2 3
0 1 2 3
01 2 3
01 2 3
MRU StateTransitions
MRU[0]
MRU StateCounters
MRU[1]
MRU[2]
MRU[3]
Misses
3
2
1
0
0A
A
A
B
B
C
12
Configuration Evaluation
MRU[0] MRU[1] MRU[2] MRU[3] Misses
3 2 1 0 0
(lru)(mru)
Delay = 6 DA + 3 DB
Delay = 6 DA + 1 DB
Delay = 6 DA
Delay = 6 DA
Energy = 6 E1 + 3 E3
Energy = 7 E2
Energy = 6 E3
Energy = 6 E4BASE
13
Tolerance and the Bank Account
• Tolerance allows more delay than BASE– DTOL = DBASE (1 + TOL)
– TOL = {0.015, 0.062, 0.25} (1/64, 1/16, 1/4)
• Bank account allows accumulation of unused tolerance
• Use account credits in later intervals– Allows aggressive resizing– Amortizes mistakes over many intervals
14
Memory Hierarchy
20 1 3 20 1 3
20 1 3
L1I-Cache
(A/B)
L1D-Cache(A, no B)
L2Unified Cache
(A/B)
One PossibleConfiguration
15
Environment
• Simplescalar simulator
• Microarchitecture is similar to Alpha 21264
• Benchmarks are a mix of SPEC95, SPEC2K, and Olden
• Energy models for buffers and caches from Buyuktosunoglu et al., GLS VLSI 2001 and Balasubramonian et al., MICRO 2000
16
Cache Results
17
Part II: Queues, Regs, and ROB
L2UnifiedCache
ROBRename
map
FPQ
IPREG
IIQ
LSQL1
Dcache
Branchpredict
L1Icache
Integer
Memory
Floating Pt
FPREG
Int FUs
FP FUs
FetchQ
18
Resizable Queues/Reg File
m
Buffer
PN
P1
N partitions of m elements
19
Buffer SizingDistribution ofBuffer Size
0
0
0
Full
Full
Full
Grow buffer
Proper size
Precise shrink
ave
ave
• 8K cycle period• Tolerances:
• 1.5% (1/64)• 6.2% (1/16)• 25.0% (1/4)
WithLimited Histogramming
20
Resizing the Register File
• Issue: Do not know when registers expire
• Solution: To make reg file smaller, move values out of partition (P) to be turned off– First, inhibit new assignments to P– Next, use a software interrupt routine to move
values via normal rename logic mov r1 r1
– Register mappings automatically updated
21
Floating Point App Results
22
Summary Results
23
Conclusion
• Simultaneous adaptation of all major regular structures– Accounting cache
– Limited histogramming for buffers
– Adaptable register file
• Local control yet tolerable performance loss
• Future work– Augment local control with global control for bounded
performance loss