University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing...
-
Upload
gwen-bradford -
Category
Documents
-
view
215 -
download
1
Transcript of University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing...
University of MichiganElectrical Engineering and Computer Science
Composite Cores:Pushing Heterogeneity into a Core
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,
and Scott Mahlke
University of Michigan
Micro 45May 8th 2012
University of MichiganElectrical Engineering and Computer Science2
High Performance Cores
High performance cores waste energy on low performance phases
PerformanceEnergy
Time
High energy yields high performance
Low performance DOES NOT yield low energy
University of MichiganElectrical Engineering and Computer Science3
Core Energy Comparison
Brooks, ISCAโ00
Dally, IEEE Computerโ08
Out-of-Order In-Order
โข Out-Of-Order contains performance enhancing hardwareโข Not necessary for correctness
Do we always need the extra hardware?
University of MichiganElectrical Engineering and Computer Science4
Previous Solution: Heterogeneous Multicore
โข 2+ Coresโข Same ISA, different implementations
โ High performance, but more energyโ Energy efficient, but less performance
โข Share memory at high levelโ Share L2 cache ( Kumar โ04) โ Coherent L2 caches (ARMโs big.LITTLE)
โข Operating System (or programmer) maps application to smallest core that provides needed performance
University of MichiganElectrical Engineering and Computer Science5
Current System Limitations
โข Migration between cores incurs high overheadsโ 20K cycles (ARMโs big.LITTLE)
โข Sample-based schedulersโ Sample different cores performances and then decide whether to
reassign the applicationโ Assume stable performance with a phase
โข Phase must be long to be recognized and exploitedโ 100M-500M instructions in lengthDo finer grained phases exist?
Can we exploit them?
University of MichiganElectrical Engineering and Computer Science6
Performance Change in GCC
โข Average IPC over a 1M instruction window (Quantum) โข Average IPC over 2K Quanta
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M0
0.5
1
1.5
2
2.5
3Big Core Little Core
Instructions
Inst
ructi
ons /
Cyc
le
100K 200K 300K 400K 500K 600K 700K 800K 900K 1M0
0.5
1
1.5
2
2.5
3Big Core Little Core
Instructions
Inst
ructi
ons /
Cyc
le
University of MichiganElectrical Engineering and Computer Science7
Finer Quantum
โข 20K instruction window from GCCโข Average IPC over 100 instruction quanta
What if we could map these to a Little Core?
160K 170K 180K0
0.5
1
1.5
2
2.5
3Big Core Little Core
Instructions
Inst
ruct
ions
/ Cy
cle
University of MichiganElectrical Engineering and Computer Science8
Our Approach: Composite Cores
โข Hypothesis: Exploiting fine-grained phases allows more opportunities to run on a Little core
โข ProblemsI. How to minimize switching overheads?II. When to switch cores?
โข QuestionsI. How fine-grained should we go?II. How much energy can we save?
University of MichiganElectrical Engineering and Computer Science9
Problem I: State Transfer
Fetch
Decode
Rename
O3Execute
dTLB
dCache
Reg File
iCache
Branch Pred
iTLB
Decode
InOExecute
Fetch
iCache
Branch Pred
iTLB
dTLB
dCache
RAT
Reg File10s of KB
<1 KB
10s of KB
State transfer costs can be very high:~20K cycles (ARMโs big.LITTLE)
Limits switching to coarse granularity:100M Instructions ( Kumarโ04)
University of MichiganElectrical Engineering and Computer Science10
Creating a Composite Core
dTLB
dCache
RAT
Reg FileDecode
O3 Execute
dCache
dTLB
Fetch
iCache
Branch Pred
iTLB
DecodeinO Execute
Reg File Mem
Load/Store Queue
Fetch
iCache
Branch Pred
iTLB
iCache
Branch Pred
iTLB Fetch
ControllerdTLB
dCache<1KB
BiguEngine
LittleuEngine
Only one uEngine active at a time
University of MichiganElectrical Engineering and Computer Science11
Hardware Sharing Overheads
โข Big uEngine needsโ High fetch widthโ Complex branch predictionโ Multiple outstanding data cache misses
โข Little uEngine wantsโ Low fetch widthโ Simple branch predictionโ Single outstanding data cache miss
โข Must build shared units for Big uEngine โ over-provision for Little uEngine
โข Assume clock gating for inactive uEngineโ Still has static leakage energy
Little pays ~8% energy overhead to use over provisioned fetch + caches
University of MichiganElectrical Engineering and Computer Science12
Problem II: When to Switch
โข Goal: Maximize time on the Little uEngine subject to maximum performance lossโข User-Configurable
โข Traditional OS-based schedulers wonโt workโ Decisions to frequentโ Needs to be made in hardware
โข Traditional sampling-based approaches wonโt workโ Performance not stable for long enoughโ Frequent switching just to sample wastes cycles
University of MichiganElectrical Engineering and Computer Science13
What uEngine to Pick
โข This value is hard to determine a priori, depends on applicationโ Use a controller to learn appropriate value over time
ฮ๐ถ๐๐ผ h h๐ ๐๐๐ ๐๐๐
Run on Big
Run on Little
200K 400K 600K 800K 1M0
0.5
1
1.5
2
2.5
3Big Core Little Core Difference
Instructions
Inst
ructi
ons /
Cyc
le
Run on Big
Run on Little
Let user configure the target value
University of MichiganElectrical Engineering and Computer Science14
Reactive Online Controller
Little uEngineTrue
Big uEngineFalse
๐ถ๐๐ผ ๐๐๐ก๐ก๐๐
ฮ๐ถ๐๐ผ h h๐ ๐๐๐ ๐๐๐
๐ถ๐๐ผ๐๐๐ Switching Controller
๐ถ๐๐ผ ๐๐๐ก๐ก๐๐
Big Model๐ถ๐๐ผ๐๐๐
โ๐ถ๐๐ผ๐๐๐ ๐๐๐ฃ๐๐
๐โโ๐ถ๐๐ผ ๐ต๐๐
+
๐ถ๐๐ผ๐๐๐ก๐ข๐๐
๐ถ๐๐ผ ๐ก๐๐๐๐๐ก
ThresholdController
๐ถ๐๐ผ๐๐๐๐๐
Little Model๐ถ๐๐ผ ๐๐๐ก๐ก๐๐
ฮ๐ถ๐๐ผ h h๐ ๐๐๐ ๐๐๐+๐ถ๐๐ผ๐๐๐ก๐ก๐๐โค๐ถ๐๐ผ๐๐๐ฮ๐ถ๐๐ผ h h๐ ๐๐๐ ๐๐๐=๐พ ๐๐ถ๐๐ผ๐๐๐๐๐+๐พ ๐โ๐ถ๐๐ผ๐๐๐๐๐
๐ถ๐๐ผ๐๐๐
User-Selected Performance
University of MichiganElectrical Engineering and Computer Science15
uEngine Modeling
while(flag){ foo(); flag = bar();}
Little uEngine IPC: 1.66
IPC: ???Big uEngine
Collect Metrics of active uEngineโข iL1, dL1 cache missesโข L2 cache missesโข Branch Mispredictsโข ILP, MLP, CPI
Use a linear model to estimate inactive uEngineโs performance
IPC: 2.15
University of MichiganElectrical Engineering and Computer Science16
EvaluationArchitectural Feature Parameters
Big uEngine 3 wide O3 @ 1.0GHz12 stage pipeline128 ROB Entries128 entry register file
Little uEngine 2 wide InOrder @ 1.0GHz8 stage pipeline32 entry register file
Memory System 32 KB L1 i/d cache, 1 cycle access1MB L2 cache, 15 cycle access1GB Main Mem, 80 cycle access
Controller 5% performance loss relative to all big core
University of MichiganElectrical Engineering and Computer Science17
Little Engine Utilization
100 1K 10K 100K 1M 10M0%
10%20%30%40%50%60%70%80%90%
100%
astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage
Quantum Length (Instructions)
Litt
le E
ngin
e U
tiliza
tion
โข 3-Wide O3 (Big) vs. 2-Wide InOrder (Little)โข 5% performance loss relative to all Big
More time on little engine with sameperformance loss
Traditional OS-Based QuantumFine-Grained Quantum
University of MichiganElectrical Engineering and Computer Science18
Engine Switches
100 1K 10K 100K 1M 10M0
500100015002000250030003500400045005000
astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage
Quantum Length (Instructions)Switc
hes /
Mill
ion
Inst
ructi
ons
Need LOTS of switching to maximize utilization
~1 Switch / 2800 Instructions
~1 Switch / 306 Instructions
University of MichiganElectrical Engineering and Computer Science19
Performance Loss
100 1K 10K 100K 1M 10M80%
85%
90%
95%
100%
105%
astarbzip2gccgobmkh264refhmmermcfomnetppsjengaverage
Quantum Length (Instructions)
Perf
orm
ance
Rel
ative
to B
ig
Composite Cores( Quantum Length = 1000 )
Switching overheads negligible until ~1000 instructions
University of MichiganElectrical Engineering and Computer Science20
Fine-Grained vs. Coarse-Grained
โข Little uEngineโs average power 8% higher โ Due to shared hardware structures
โข Fine-Grained can map 41% more instructions to the Little uEngine over Coarse-Grained.
โข Results in overall 27% decrease in average power over Coarse-Grained
University of MichiganElectrical Engineering and Computer Science21
1. OracleKnows both uEngineโs performance for all quantums
2. Perfect PastKnows both uEngineโs past performance perfectly
3. ModelKnows only active uEngineโs past, models inactive uEngine using default weights
All models target 95% of the all Big uEngineโs performance
Decision Techniques
University of MichiganElectrical Engineering and Computer Science22
Little Engine Utilization
Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average0%
10%20%30%40%50%60%70%80%90%
100%
Oracle Perfect Past
Model
Dyn
amic
Inst
ructi
ons
On
Litt
le
High utilization for memory bound applicationIssue width dominates computation boundMaps 25% of the dynamic instructions
onto the Little uEngine
University of MichiganElectrical Engineering and Computer Science23
Energy Savings
โข Includes the overhead of shared hardware structures
Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average0%
10%20%30%40%50%60%70%80%90%
100%
Oracle
Perfect Past
Model
Ener
gy S
avin
gs R
elati
ve to
Big
18% reduction in energy consumption
University of MichiganElectrical Engineering and Computer Science24
User-Configured Performance
1% 5% 10% 20% 1% 5% 10% 20% 1% 5% 10% 20%Utilization Overall Performance Energy Savings
0%10%20%30%40%50%60%70%80%90%
100%
1% performance loss yields 4% energy savings20% performance loss yields 44% energy savings
University of MichiganElectrical Engineering and Computer Science25
More Details in the Paper
โข Estimated uEngine area overheadsโข uEngine model accuracyโข Switching timing diagramโข Hardware sharing overheads analysis
University of MichiganElectrical Engineering and Computer Science26
Conclusions
โข Even high performance applications experience fine-grained phases of low throughputโ Map those to a more efficient core
โข Composite Cores allowsโ Fine-grained migration between coresโ Low overhead switching
โข 18% energy savings by mapping 25% of the instructions to Little uEngine with a 5% performance loss
Questions?
University of MichiganElectrical Engineering and Computer Science
Composite Cores:Pushing Heterogeneity into a Core
Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch,
and Scott Mahlke
University of Michigan
Micro 45May 8th 2012
University of MichiganElectrical Engineering and Computer Science28
Back Up
University of MichiganElectrical Engineering and Computer Science29
The DVFS Question
โข Lower voltage is useful when:โ L2 Miss (stalled on commit)
โข Little uArch is useful when:โ Stalled on L2 Miss (stalled at issue)โ Frequent branch mispredicts (shorter pipeline)โ Dependent Computation
http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf
University of MichiganElectrical Engineering and Computer Science30
Sharing Overheads
astar bzip2 gcc gobmk h264ref hmmer mcf omnetpp sjeng average0%
10%20%30%40%50%60%70%80%90%
100%110%
Big uEngine Little Core Little uEngine
Ave
rage
Pow
er R
elati
ve to
the
Big
Core
University of MichiganElectrical Engineering and Computer Science31
Performance
5% performance loss
Astar Bzip2 Gcc GoBmk H264ref Hmmer Mcf OmnetPP Sjeng Average90%
93%
95%
98%
100%
103%
Oracle Perfect Past
Model
Perf
orm
ance
Rel
ative
to B
ig
University of MichiganElectrical Engineering and Computer Science32
Model Accuracy
-100% -50% 0% 50% 100%0%
5%
10%
15%
20%
25%
30%Model Average Performance
Percent Deviation From Actual
Perc
ent o
f Qua
ntum
s
-100% -50% 0% 50% 100%0%
5%
10%
15%
20%
25%
30%
35%Model Average Performance
Percent Deviation From ActualPe
rcen
t of Q
uant
ums
Little -> Big Big -> Little
University of MichiganElectrical Engineering and Computer Science33
Regression Coefficients
Little -> Big Big -> Little0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
L2 MissBranch MispredictsILPL2 HitMLPActive uEngine CyclesConstant
Rela
tive
Coeffi
cien
t Mag
natu
de
University of MichiganElectrical Engineering and Computer Science34
Different Than Kumar et al.Kumar et al. Composite Cores
โข Coarse-grained switchingโข OS Managed
โข Fine-grain switchingโข Hardware Managed
โข Minimal shared state (L2โs) โข Maximizes shared state (L2โs, L1โs, Branch Predictor, TLBs)
โข Requires sampling โข On-the-fly prediction
โข 6 Wide O3 vs. 8 Wide O3โข Has InOrder, but never uses it!
โข 3 Wide O3 vs. 2 Wide InOrder
Coarse-grained vs. fine-grained
University of MichiganElectrical Engineering and Computer Science35
Register File TransferRAT Num -
Registers
Num Value
Registers Num Value
Commit
3 stage pipeline1. Map to physical register in RAT2. Read physical register3. Write to new register file
If commit updates, repeat
University of MichiganElectrical Engineering and Computer Science36
uEngine Model
โข Linear model: โ : Average uEngine performanceโ : Performance counter valueโ Weight of performance counter
โข Different weights for big and little uEngine modelsโข Fixed vs. per-application weights?
โ Default weights, fixed at design timeโ Per-application weights