Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav...
-
Upload
ashlynn-glenn -
Category
Documents
-
view
222 -
download
0
description
Transcript of Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav...
Using Interaction Cost (icost) for Microarchitectural Bottleneck
Analysis
Brian Fields1 Rastislav Bodik1 Mark Hill2 Chris Newburn3
1UC-Berkeley, 2UW-Madison, 3Intel
OutlineInteraction Cost
Hardware profiler
Bottleneck analysis complicated by parallelismParallelism causes interactions• Qualitative: parallel and serial interactions
Icost case study: designing a deep pipeline
Icost “shotgun” profiler• Replace current performance counters
• Quantitative: interaction cost (icost)
Why?-architectural parallelism complicates
performance understanding
Bottleneck analysis is hard
• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
• Two parallel cache misses
• A multiply and window stall
What we want from bottleneck analysis
Performance cost (or reward) speedup when the bottleneck is
removed
Q: What if two bottlenecks interact?
Our solution: measure interactions
Two parallel cache misses (Each 100 cycles)miss #1 (100)miss #2 (100)
Cost(miss #1) = 0Cost(miss #2) = 0Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
0icost = aggregate cost – sum of individual costs
= 100 – 0 – 0 = 100
Interaction cost (icost)icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1miss #2
Interaction cost (icost)icost = aggregate cost – sum of individual costs
miss #1miss #2
1. Positive icost parallel interaction
2. Zero icost independent
miss #1 miss
#2. . .
3. Negative icost ?
Negative icostTwo serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?ALU latency (110 cycles)
Negative icostTwo serial cache misses (data dependent)
Cost(miss #1) = 90Cost(miss #2) = 90Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
Interaction cost (icost)icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay TrapLSQ stall
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
Icost Case Study: Deep pipelines
Deep pipelines cause long latency loops:• level-one (DL1) cache access,
issue-wakeup, branch misprediction, …
But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?
Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses?
Really, looking for serial interactions!
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortexDL1
DL1+window
DL1+bwDL1+bmis
pDL1+dmis
sDL1+alu
DL1+imiss...
Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortexDL1 30.5 %
DL1+window
DL1+bwDL1+bmispDL1+dmiss
DL1+aluDL1+imiss
...Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortexDL1 30.5 %
DL1+window
-15.3
DL1+bw 6.0DL1+bmisp -3.4DL1+dmiss -0.4DL1+alu -8.2
DL1+imiss 0.0... ...
Total 100.0
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortexDL1 18.3 % 30.5 % 25.8 %
DL1+window
-4.2 -15.3 -24.5
DL1+bw 10.0 6.0 15.5DL1+bmisp -7.0 -3.4 -0.3DL1+dmiss -1.4 -0.4 -1.4DL1+alu -1.6 -8.2 -4.7
DL1+imiss 0.1 0.0 0.4... ... ... ...
Total 100.0 100.0 100.0
Vortex Breakdowns, enlarging the window
64 128 256DL1
DL1+window
DL1+bwDL1+bmispDL1+dmiss
DL1+aluDL1+imiss
...Total
Vortex Breakdowns, enlarging the window
64 128 256DL1 25.8 8.9 3.9
DL1+window
-24.5 -7.7 -2.6
DL1+bw 15.5 16.7 13.2DL1+bmisp -0.3 -0.6 -0.8DL1+dmiss -1.4 -2.1 -2.8
DL1+alu -4.7 -2.5 -0.4DL1+imiss 0.4 0.5 0.3
... ... ... ...Total 100.0 80.8 75.0
Bottleneck analysis complicated by parallelismParallelism causes interactions
• Qualitative: parallel and serial interactions• Quantitative: interaction cost (icost)
Icost case study: designing a deep pipeline• Exploiting serial interactions
Outline
Icost “shotgun” profiler• Overcome the limitations of performance counters
Interaction Cost
Hardware profiler
Profiling goalGoal:
• Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
Profiling goalGoal:
• Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
. . .. . .DNA
“Shotgun” genome sequencing
. . .. . .
. . . . . .Find overlaps among samples
DNA
Mapping “shotgun” to our situation
many dynamic instructions
Icache missDcache missBranch misp.No event
. . .. . .
Profiler hardware requirements
. . .. . .
Profiler hardware requirements
Match!
Bottleneck analysis is complicated by parallelism
Conclusion
Parallelism is interpreted with interaction cost (icost)• Three possibilities: independent, parallel, or serial
Applies to all instructions, resources, events
Enabled by the “shotgun” profiler:Interaction cost overcomes limitations of counters
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Decode, rename
Multiply + pipe latency
Icache miss
Profiler software requirementsSoftware puts the graph together
Skeleton sample
Detailed samples
(with matching PC)
Compare Icost and Sensitivity StudyCorollary to DL1 and ROB serial
interaction:As load latency increases, the benefit from enlarging the ROB increases.
E E EE E
F F FF F
C C CC C
E
F
C
1
2
1
1 2 3 2 3
1111
0 1 0 1 1
01010
2
2
1
i1 i2 i3 i4 i5 i6
4
3
DL1 access
Compare Icost and Sensitivity Study
0
5
10
15
20
25
64 128 192 256
ROB size
Spee
dup 10
54321
DL1 Latency
Compare Icost and Sensitivity StudySensitivity Study Advantages
• More information • e.g., concave or convex curves
Interaction Cost Advantages• Easy (automatic) interpretation
• Sign and magnitude have well defined meanings
• Concise communication• DL1 and ROB interact serially