CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V....

CS 7960-4 Lecture 4

Clock Rate vs. IPC: The End of the Road forConventional Microarchitectures

V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. BurgerUT-AustinISCA ’00

Previous Papers

• Limits of ILP – it is probably worth doing o-o-o superscalar

• Complexity-Effective – wire delays make the implementations harder and increase latencies

• Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance

1995-2000

• Figure 1. Clock speed has improved by 50% every year

Reduction in logic delays Deeper pipelines This will soon end

• IPC has gone up dramatically (the increased complexity was worth it) Will this end too?

Wire Scaling

• Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive)

• As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1)

• Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase)

• The equations are different, but the end result is similar to Palacharla’s (without repeaters)

Wire Scaling

• With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width

• In going from 250nm 35nm,

5mm wire delay 170ps 390ps delay to cross X gates 170ps 55ps SIA clock speed 0.75GHz 13.5GHz delay to cross X gates 0.13 cyc 0.75 cycles

• We could increase wire width, but that compromises bandwidth

Clock Scaling

• Logic delay (the FO4 delay) scales linearly with gate length

• Likewise, work per pipeline stage has also been shrinking (Fig. 2)

• The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage

• A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies

Clock Scaling

• While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease

On-Chip Wire Delays

• The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations)

Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues)

• Chip area is steadily increasing Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip!

Processors are becoming communication-bound

Processor Structure Delays

• To model the microarchitecture, they estimate the delays of all wire-limited structures

Structure fSIA f8 f16

64K-2-port L1 7 5 3

64-entry 10-port regfile 3 2 1

20-entry 8-port issueq 3 2 1

64-entry 8-port ROB 3 2 1

• Weakness: bypass delays are not considered

Microarchitecture Scaling

• Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit

• Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines

• Any other approaches?

Microarchitecture Scaling

• Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit

• Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines

• Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication

IPC Comparisons

20-IQ

40Regs

F

F

F

F

20-IQ

40Regs

F

F

F

F

2-cycle wakeup2-cycle regread2-cycle bypass

15-IQ

30Regs

F

F

F

15-IQ

30Regs

F

F

F

15-IQ

30Regs

F

F

F

Pipeline Scaling

Capacity Scaling

Replicated Capacity Scaling

Results

• Tables on Pg. 10

• Every instruction experiences longer latencies

• IPCs are much lower for aggressive clocks

• Overall performance is still comparable for all approaches

Results

• In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5%

• Slow growth because pipeline depth and IPC increase will stagnate

Questionable Assumptions

• Additional transistors are not being used to improve IPC

• All instructions pay wire-delay penalties

Conclusions

• Large monolithic cores will perform poorly – microarchitectures will have to be partitioned

• On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s

• Future proposals should be wire-delay-sensitive

Next Class’ Paper

• “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001

• Instruction steering heuristics to balance load and minimize communication

Title

• Bullet

CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V....

Documents

Transcript of CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V....