By Dan Stafford - Rochester Institute of...
Transcript of By Dan Stafford - Rochester Institute of...
![Page 1: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/1.jpg)
By Dan Stafford
![Page 2: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/2.jpg)
Background◦ Heterogeneous Architectures
Performance Modeling◦ Single Core Performance Profiling◦ Multicore Performance Estimation
Test Cases◦ Multicore Design Space
Results & Observations◦ General◦ Limited Off-Chip Bandwidth◦ Impact of LLC Size◦ Optimal Heterogeneous Design◦ Job Scheduling
Summary of Design Considerations References
![Page 3: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/3.jpg)
![Page 4: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/4.jpg)
Typically used to obtain a higher performance for a lower power budget
CPU/GPU Heterogeneous Systems◦ Intel Core Series, AMD Fusion, NVIDIA Tegra
Single ISA Heterogeneous Systems◦ Energy optimized cores◦ Performance optimized cores◦ Every core shares a common ISA◦ ARM big.LITTLE, NVIDIA Kal-El
Clock Rate Heterogeneous Systems◦ Homogenous cores◦ Different clock rates
![Page 5: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/5.jpg)
![Page 6: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/6.jpg)
Single Core CPI Memory CPI◦ Fraction of single core CPI waiting for memory
Stack Distance Counters (SDC)s◦ Captures the programs temporal memory access in
the Last Level of Cache (LLC)
All metrics captured every 20M instructions SPEC CPU2006 workloads
![Page 7: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/7.jpg)
Profiles each core type across all SPEC CPU2006 workloads
Single-core profiles then used to estimate multicore performance◦ Traditional methods take 80 plus days◦ Only takes a single day◦ Accurate within 2.1%
![Page 8: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/8.jpg)
![Page 9: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/9.jpg)
Out-of-Order cores◦ 4-wide: 128-entry reorder buffer◦ 2-wide: 32-entry reorder buffer
In-Order cores◦ 4-wide, 2-wide, and scalar
Caches◦ LRU policy◦ Private L1 instruction and data caches 32 KB, 8-way set associative
◦ Private L2 cache 256KB 8-way set associative
◦ Shared L3 cache (LLC) 1-4MB 16-way set associative
![Page 10: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/10.jpg)
BCE – Base Core Equivalent◦ Relative chip area
measurement Heterogeneous
designs configured to use 40 BCEs
#BCEs
scalar in-order core 1
2-wide in-order core 2
4-wide in-order core 3
2-wide out-of-order core 4
4-wide out-of-order core 8
512KB LLC slice 1
![Page 11: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/11.jpg)
![Page 12: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/12.jpg)
System Throughput◦ Multicore performance from system perspective◦ ∑ ,
,
Average Normalized Turnaround Time◦ User perceived performance◦ ∑ ,
,
Note: n independent jobs and coresp programs
![Page 13: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/13.jpg)
[1]
![Page 14: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/14.jpg)
Simple in order cores have better system throughput
Aggressive out-of-order cores have better turnaround time
![Page 15: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/15.jpg)
[1]
![Page 16: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/16.jpg)
[1]
![Page 17: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/17.jpg)
Same tradeoff between system throughput and turnaround time
Some heterogeneous configurations outperform homogenous configurations
Heterogeneity allows more precise control over the system throughput and turnaround time
Two different core types provide the majority of the benefit from heterogeneity
![Page 18: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/18.jpg)
[1]
![Page 19: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/19.jpg)
[1]
![Page 20: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/20.jpg)
Limiting the off-chip bandwidth will proportionally affect the per-program performance more
Best performance achieved using heterogeneous configurations
![Page 21: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/21.jpg)
[1]
![Page 22: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/22.jpg)
Cache reduces the off-chip bandwidth pressure
Under unlimited bandwidth◦ Less cache leads to integrating more cores together◦ Assuming same chip area vs. with cache
![Page 23: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/23.jpg)
[1]
![Page 24: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/24.jpg)
High throughput: single-issue and dual-issue in-order cores
Per-program performance: At least one out-of-order core
![Page 25: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/25.jpg)
![Page 26: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/26.jpg)
Optimal Mapping◦ Optimal mapping so performance is optimized◦ Prior profiling of all configurations◦ Not feasible
Cache-miss-rate◦ Higher LLC miss-rate jobs mapped to lower-end
cores Relative Slowdown◦ Assumes relative performance of each job is known◦ Job with highest slowdown on smaller core assigned
to higher performing core Random
![Page 27: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/27.jpg)
Two Core Types◦ 4-wide out-of-order◦ 2-wide in-order
6 separate heterogeneous configurations◦ 4-wide out-of-order cores◦ 2-wide in-order cores
500 randomly chosen multi-program workload mixes
![Page 28: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/28.jpg)
[1]
![Page 29: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/29.jpg)
None of the scheduling techniques are quantitatively better◦ Cache-miss rate does not take into account
memory parallelism◦ Relative slowdown requires a substantial amount of
information Active area of research for all types of
heterogeneous architecture
![Page 30: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/30.jpg)
![Page 31: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/31.jpg)
Perform many simulations before committing to a specific architecture
Large LLC Cache vs. Additional Cores◦ Increase LLC cache if bandwidth constrained◦ Additional cores otherwise
System Throughput vs. Per-Program Performance◦ In-order cores have better system throughput◦ Out-of-order cores have better per-program
throughput
![Page 32: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/32.jpg)
![Page 33: By Dan Stafford - Rochester Institute of Technologymeseec.ce.rit.edu/722-projects/spring2016/1-1.pdf · By Dan Stafford Background ... System Throughput Multicore performance from](https://reader030.fdocuments.us/reader030/viewer/2022041207/5d6078d288c993ad688bc0f8/html5/thumbnails/33.jpg)
[1]K. Van Craeynest and L. Eeckhout, "Understanding fundamental design choices in single-ISA heterogeneous multicore architectures", TACO, vol. 9, no. 4, pp. 1-23, 2013.
[2]R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi and K. Farkas, "Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance", ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 64, 2004.