Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav...

Using Interaction Cost (icost) for Microarchitectural Bottleneck

Analysis

Brian Fields1 Rastislav Bodik1 Mark Hill2 Chris Newburn3

1UC-Berkeley, 2UW-Madison, 3Intel

OutlineInteraction Cost

Hardware profiler

Bottleneck analysis complicated by parallelismParallelism causes interactions• Qualitative: parallel and serial interactions

Icost case study: designing a deep pipeline

Icost “shotgun” profiler• Replace current performance counters

• Quantitative: interaction cost (icost)

Why?-architectural parallelism complicates

performance understanding

Bottleneck analysis is hard

• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

• Two parallel cache misses

• A multiply and window stall

What we want from bottleneck analysis

Performance cost (or reward) speedup when the bottleneck is

removed

Q: What if two bottlenecks interact?

Our solution: measure interactions

Two parallel cache misses (Each 100 cycles)miss #1 (100)miss #2 (100)

Cost(miss #1) = 0Cost(miss #2) = 0Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction100 0 +

0icost = aggregate cost – sum of individual costs

= 100 – 0 – 0 = 100

Interaction cost (icost)icost = aggregate cost – sum of individual costs

2. Zero icost ?

1. Positive icost parallel

interaction

miss #1miss #2


miss #1miss #2

1. Positive icost parallel interaction

2. Zero icost independent

miss #1 miss

#2. . .

3. Negative icost ?

Negative icostTwo serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?ALU latency (110 cycles)

Negative icostTwo serial cache misses (data dependent)

Cost(miss #1) = 90Cost(miss #2) = 90Cost({miss #1, miss #2}) = 90

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

icost = aggregate cost – sum of individual costs

= 90 – 90 – 90 = -90Negative icost serial interaction


miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2. . .

3. Negative icost serial

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay TrapLSQ stall

Why care about serial interactions?

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1

Icost Case Study: Deep pipelines

Deep pipelines cause long latency loops:• level-one (DL1) cache access,

issue-wakeup, branch misprediction, …

But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?

Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses?

Really, looking for serial interactions!


E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortexDL1

DL1+window

DL1+bwDL1+bmis

pDL1+dmis

sDL1+alu

DL1+imiss...

Total


gcc gzip vortexDL1 30.5 %

DL1+window

DL1+bwDL1+bmispDL1+dmiss

DL1+aluDL1+imiss

...Total


gcc gzip vortexDL1 30.5 %

DL1+window

-15.3

DL1+bw 6.0DL1+bmisp -3.4DL1+dmiss -0.4DL1+alu -8.2

DL1+imiss 0.0... ...

Total 100.0


gcc gzip vortexDL1 18.3 % 30.5 % 25.8 %

DL1+window

-4.2 -15.3 -24.5

DL1+bw 10.0 6.0 15.5DL1+bmisp -7.0 -3.4 -0.3DL1+dmiss -1.4 -0.4 -1.4DL1+alu -1.6 -8.2 -4.7

DL1+imiss 0.1 0.0 0.4... ... ... ...

Total 100.0 100.0 100.0

Vortex Breakdowns, enlarging the window

64 128 256DL1

DL1+window

DL1+bwDL1+bmispDL1+dmiss

DL1+aluDL1+imiss

...Total

Vortex Breakdowns, enlarging the window

64 128 256DL1 25.8 8.9 3.9

DL1+window

-24.5 -7.7 -2.6

DL1+bw 15.5 16.7 13.2DL1+bmisp -0.3 -0.6 -0.8DL1+dmiss -1.4 -2.1 -2.8

DL1+alu -4.7 -2.5 -0.4DL1+imiss 0.4 0.5 0.3

... ... ... ...Total 100.0 80.8 75.0

Bottleneck analysis complicated by parallelismParallelism causes interactions

• Qualitative: parallel and serial interactions• Quantitative: interaction cost (icost)

Icost case study: designing a deep pipeline• Exploiting serial interactions

Outline

Icost “shotgun” profiler• Overcome the limitations of performance counters

Interaction Cost

Hardware profiler

Profiling goalGoal:

• Construct graph

many dynamic instructions

Constraint:• Can only sample sparsely

Profiling goalGoal:

• Construct graph

Constraint:• Can only sample sparsely

DNA

DNA strand

Genome sequencing

“Shotgun” genome sequencing

DNA


. . .. . .DNA


. . .. . .

. . . . . .Find overlaps among samples

DNA

Mapping “shotgun” to our situation

many dynamic instructions

Icache missDcache missBranch misp.No event

. . .. . .

Profiler hardware requirements

. . .. . .

Profiler hardware requirements

Match!

Bottleneck analysis is complicated by parallelism

Conclusion

Parallelism is interpreted with interaction cost (icost)• Three possibilities: independent, parallel, or serial

Applies to all instructions, resources, events

Enabled by the “shotgun” profiler:Interaction cost overcomes limitations of counters


E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Decode, rename

Multiply + pipe latency

Icache miss

Profiler software requirementsSoftware puts the graph together

Skeleton sample

Detailed samples

(with matching PC)

Compare Icost and Sensitivity StudyCorollary to DL1 and ROB serial

interaction:As load latency increases, the benefit from enlarging the ROB increases.

E E EE E

F F FF F

C C CC C

E

F

C

1

2

1

1 2 3 2 3

1111

0 1 0 1 1

01010

2

2

1

i1 i2 i3 i4 i5 i6

4

3

DL1 access

Compare Icost and Sensitivity Study

0

5

10

15

20

25

64 128 192 256

ROB size

Spee

dup 10

54321

DL1 Latency

Compare Icost and Sensitivity StudySensitivity Study Advantages

• More information • e.g., concave or convex curves

Interaction Cost Advantages• Easy (automatic) interpretation

• Sign and magnitude have well defined meanings

• Concise communication• DL1 and ROB interact serially

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav...

Documents

Transcript of Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav...