CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V....
-
Upload
harvey-booker -
Category
Documents
-
view
212 -
download
0
Transcript of CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V....
![Page 1: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/1.jpg)
CS 7960-4 Lecture 4
Clock Rate vs. IPC: The End of the Road forConventional Microarchitectures
V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. BurgerUT-AustinISCA ’00
![Page 2: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/2.jpg)
Previous Papers
• Limits of ILP – it is probably worth doing o-o-o superscalar
• Complexity-Effective – wire delays make the implementations harder and increase latencies
• Today’s paper – these latencies severely impact IPCs and slow the growth in processor performance
![Page 3: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/3.jpg)
1995-2000
• Figure 1. Clock speed has improved by 50% every year
Reduction in logic delays Deeper pipelines This will soon end
• IPC has gone up dramatically (the increased complexity was worth it) Will this end too?
![Page 4: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/4.jpg)
Wire Scaling
• Multiple wire layers – the SIA roadmap predicts dimensions (somewhat aggressive)
• As transistor widths shrink, wires become thinner, and their resistivity goes up (quadratically – Table 1)
• Parallel-plate capacitance reduces, but coupling capacitance increases (slight overall increase)
• The equations are different, but the end result is similar to Palacharla’s (without repeaters)
![Page 5: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/5.jpg)
Wire Scaling
• With repeaters, delay of a fixed-length wire does not go up quadratically as we shrink gate-width
• In going from 250nm 35nm,
5mm wire delay 170ps 390ps delay to cross X gates 170ps 55ps SIA clock speed 0.75GHz 13.5GHz delay to cross X gates 0.13 cyc 0.75 cycles
• We could increase wire width, but that compromises bandwidth
![Page 6: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/6.jpg)
Clock Scaling
• Logic delay (the FO4 delay) scales linearly with gate length
• Likewise, work per pipeline stage has also been shrinking (Fig. 2)
• The SIA predicts that today’s 16 FO4/stage delay will shrink to 5.6 FO4/stage
• A 64-bit add takes 5.5 FO4 – hence, they examine SIA (super-aggressive), 8-FO4 (aggressive), and 16-FO4 (conservative) scaling strategies
![Page 7: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/7.jpg)
Clock Scaling
• While the 15-20% improvement in technology scaling will continue, the 15-20% improvement in pipeline depth will cease
![Page 8: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/8.jpg)
On-Chip Wire Delays
• The number of bits reachable in a cycle are shrinking (by more than a factor of two across three generations)
Structures that fit in a cycle today, will have to be shrunk (smaller regfiles, issue queues)
• Chip area is steadily increasing Less than 1% of the chip reachable in a cycle, 30 cycles to go across the chip!
Processors are becoming communication-bound
![Page 9: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/9.jpg)
Processor Structure Delays
• To model the microarchitecture, they estimate the delays of all wire-limited structures
Structure fSIA f8 f16
64K-2-port L1 7 5 3
64-entry 10-port regfile 3 2 1
20-entry 8-port issueq 3 2 1
64-entry 8-port ROB 3 2 1
• Weakness: bypass delays are not considered
![Page 10: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/10.jpg)
Microarchitecture Scaling
• Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit
• Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines
• Any other approaches?
![Page 11: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/11.jpg)
Microarchitecture Scaling
• Capacity Scaling: constant access latencies in cycles (simpler designs), scale capacities down to make it fit
• Pipeline Scaling: constant capacities, latencies go up, hence, deeper pipelines
• Replicated Capacity Scaling: fast core with few resources, but lots of them – high IPC if you can localize communication
![Page 12: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/12.jpg)
IPC Comparisons
20-IQ
40Regs
F
F
F
F
20-IQ
40Regs
F
F
F
F
2-cycle wakeup2-cycle regread2-cycle bypass
15-IQ
30Regs
F
F
F
15-IQ
30Regs
F
F
F
15-IQ
30Regs
F
F
F
Pipeline Scaling
Capacity Scaling
Replicated Capacity Scaling
![Page 13: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/13.jpg)
Results
• Tables on Pg. 10
• Every instruction experiences longer latencies
• IPCs are much lower for aggressive clocks
• Overall performance is still comparable for all approaches
![Page 14: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/14.jpg)
Results
• In 17 years, we are seeing only a 7-fold speedup (historically, it should have been 1720) – annual increase of 12.5%
• Slow growth because pipeline depth and IPC increase will stagnate
![Page 15: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/15.jpg)
Questionable Assumptions
• Additional transistors are not being used to improve IPC
• All instructions pay wire-delay penalties
![Page 16: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/16.jpg)
Conclusions
• Large monolithic cores will perform poorly – microarchitectures will have to be partitioned
• On-chip caches will be the biggest bottlenecks – 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s
• Future proposals should be wire-delay-sensitive
![Page 17: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/17.jpg)
Next Class’ Paper
• “Dynamic Code Partitioning for Clustered Architectures”, UPC-Barcelona, 2001
• Instruction steering heuristics to balance load and minimize communication
![Page 18: CS 7960-4 Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.](https://reader036.fdocuments.us/reader036/viewer/2022072014/56649ea05503460f94ba3a71/html5/thumbnails/18.jpg)
Title
• Bullet