Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6...
Transcript of Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6...
![Page 1: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/1.jpg)
Computer Architecture:
Interconnects (Part I)
Prof. Onur Mutlu
Carnegie Mellon University
![Page 2: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/2.jpg)
Memory Interference and QoS Lectures
These slides are from a lecture delivered at Bogazici University (June 10, 2013)
Videos:
https://www.youtube.com/playlist?list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj
https://www.youtube.com/watch?v=rKRFlpavGs8&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=3
https://www.youtube.com/watch?v=go34f7v9KIs&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=4
https://www.youtube.com/watch?v=nWDoUqoZY4s&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=5
2
![Page 3: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/3.jpg)
Multi-Core Architectures and
Shared Resource Management
Lecture 3: Interconnects
Prof. Onur Mutlu
http://www.ece.cmu.edu/~omutlu
Bogazici University
June 10, 2013
![Page 4: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/4.jpg)
Last Lecture
Wrap up Asymmetric Multi-Core Systems
Handling Private Data Locality
Asymmetry Everywhere
Resource Sharing vs. Partitioning
Cache Design for Multi-core Architectures
MLP-aware Cache Replacement
The Evicted-Address Filter Cache
Base-Delta-Immediate Compression
Linearly Compressed Pages
Utility Based Cache Partitioning
Fair Shared Cache Partitinoning
Page Coloring Based Cache Partitioning
4
![Page 5: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/5.jpg)
Agenda for Today
Interconnect design for multi-core systems
(Prefetcher design for multi-core systems)
(Data Parallelism and GPUs)
5
![Page 6: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/6.jpg)
Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems
Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003, IEEE Micro 2003.
Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009, IEEE Micro 2010.
Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro 2011.
Joao et al., “Bottleneck Identification and Scheduling for Multithreaded Applications,” ASPLOS 2012.
Joao et al., “Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs,” ISCA 2013.
Recommended
Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967.
Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996.
Mutlu et al., “Techniques for Efficient Processing in Runahead Execution Engines,” ISCA 2005, IEEE Micro 2006.
6
![Page 7: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/7.jpg)
Videos for Lecture June 6 (Lecture 1.1)
Runahead Execution http://www.youtube.com/watch?v=z8YpjqXQJIA&list=PL5PHm2jkkXmidJOd
59REog9jDnPDTG6IJ&index=28
Multiprocessors Basics:http://www.youtube.com/watch?v=7ozCK_Mgxfk&list=PL5PHm2jkkX
midJOd59REog9jDnPDTG6IJ&index=31
Correctness and Coherence: http://www.youtube.com/watch?v=U-VZKMgItDM&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=32
Heterogeneous Multi-Core: http://www.youtube.com/watch?v=r6r2NJxj3kI&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=34
7
![Page 8: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/8.jpg)
Readings for Lecture June 7 (Lecture 1.2) Required – Caches in Multi-Core
Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005.
Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012.
Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.
Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” SAFARI Technical Report 2013.
Recommended
Qureshi et al., “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
8
![Page 9: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/9.jpg)
Videos for Lecture June 7 (Lecture 1.2)
Cache basics:
http://www.youtube.com/watch?v=TpMdBrM1hVc&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=23
Advanced caches:
http://www.youtube.com/watch?v=TboaFbjTd-E&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=24
9
![Page 10: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/10.jpg)
Readings for Lecture June 10 (Lecture 1.3) Required – Interconnects in Multi-Core
Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009.
Fallin et al., “CHIPPER: A Low-Complexity Bufferless Deflection Router,” HPCA 2011.
Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012.
Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009.
Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010, IEEE Micro 2011.
Recommended
Grot et al. “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.
Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011, IEEE Micro 2012.
10
![Page 11: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/11.jpg)
More Readings for Lecture 1.3
Studies of congestion and congestion control in on-chip vs. internet-like networks
George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx)
George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu, "Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?" Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HOTNETS), Monterey, CA, October 2010. Slides (ppt) (key)
11
![Page 12: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/12.jpg)
Videos for Lecture June 10 (Lecture 1.3)
Interconnects
http://www.youtube.com/watch?v=6xEpbFVgnf8&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=33
GPUs and SIMD processing Vector/array processing basics: http://www.youtube.com/watch?v=f-
XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15
GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19
GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20
12
![Page 13: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/13.jpg)
Readings for Prefetching
Prefetching
Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007.
Ebrahimi et al., “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009.
Ebrahimi et al., “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems,” HPCA 2009.
Ebrahimi et al., “Prefetch-Aware Shared Resource Management for Multi-Core Systems,” ISCA 2011.
Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008.
Recommended
Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of Prefetching,” MICRO 2009.
13
![Page 14: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/14.jpg)
Videos for Prefetching
Prefetching
http://www.youtube.com/watch?v=IIkIwiNNl0c&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=29
http://www.youtube.com/watch?v=yapQavK6LUk&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=30
14
![Page 15: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/15.jpg)
Readings for GPUs and SIMD
GPUs and SIMD processing
Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011.
Jog et al., “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance,” ASPLOS 2013.
Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013.
Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008.
Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007.
15
![Page 16: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/16.jpg)
Videos for GPUs and SIMD
GPUs and SIMD processing Vector/array processing basics: http://www.youtube.com/watch?v=f-
XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15
GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19
GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20
16
![Page 17: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/17.jpg)
Online Lectures and More Information
Online Computer Architecture Lectures
http://www.youtube.com/playlist?list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ
Online Computer Architecture Courses
Intro: http://www.ece.cmu.edu/~ece447/s13/doku.php
Advanced: http://www.ece.cmu.edu/~ece740/f11/doku.php
Advanced: http://www.ece.cmu.edu/~ece742/doku.php
Recent Research Papers
http://users.ece.cmu.edu/~omutlu/projects.htm
http://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en
17
![Page 18: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/18.jpg)
Interconnect Basics
18
![Page 19: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/19.jpg)
Interconnect in a Multi-Core System
19
Shared
Storage
![Page 20: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/20.jpg)
Where Is Interconnect Used?
To connect components
Many examples
Processors and processors
Processors and memories (banks)
Processors and caches (banks)
Caches and caches
I/O devices
20
Interconnection network
![Page 21: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/21.jpg)
Why Is It Important?
Affects the scalability of the system
How large of a system can you build?
How easily can you add more processors?
Affects performance and energy efficiency
How fast can processors, caches, and memory communicate?
How long are the latencies to memory?
How much energy is spent on communication?
21
![Page 22: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/22.jpg)
Interconnection Network Basics
Topology
Specifies the way switches are wired
Affects routing, reliability, throughput, latency, building ease
Routing (algorithm)
How does a message get from source to destination
Static or adaptive
Buffering and Flow Control
What do we store within the network?
Entire packets, parts of packets, etc?
How do we throttle during oversubscription?
Tightly coupled with routing strategy
22
![Page 23: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/23.jpg)
Topology
Bus (simplest)
Point-to-point connections (ideal and most costly)
Crossbar (less costly)
Ring
Tree
Omega
Hypercube
Mesh
Torus
Butterfly
…
23
![Page 24: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/24.jpg)
Metrics to Evaluate Interconnect Topology
Cost
Latency (in hops, in nanoseconds)
Contention
Many others exist you should think about
Energy
Bandwidth
Overall system performance
24
![Page 25: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/25.jpg)
Bus
+ Simple
+ Cost effective for a small number of nodes
+ Easy to implement coherence (snooping and serialization)
- Not scalable to large number of nodes (limited bandwidth, electrical loading reduced frequency)
- High contention fast saturation
25
MemoryMemoryMemoryMemory
Proc
cache
Proc
cache
Proc
cache
Proc
cache
0 1 2 3 4 5 6 7
![Page 26: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/26.jpg)
Point-to-Point
Every node connected to every other
+ Lowest contention
+ Potentially lowest latency
+ Ideal, if cost is not an issue
-- Highest cost
O(N) connections/ports
per node
O(N2) links
-- Not scalable
-- How to lay out on chip?
26
0
1
2
3
4
5
6
7
![Page 27: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/27.jpg)
Crossbar
Every node connected to every other (non-blocking) except one can be using the connection at any given time
Enables concurrent sends to non-conflicting destinations
Good for small number of nodes
+ Low latency and high throughput
- Expensive
- Not scalable O(N2) cost
- Difficult to arbitrate as N increases
Used in core-to-cache-bank
networks in
- IBM POWER5
- Sun Niagara I/II
27
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
![Page 28: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/28.jpg)
Another Crossbar Design
28
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
![Page 29: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/29.jpg)
Sun UltraSPARC T2 Core-to-Cache Crossbar
High bandwidth interface between 8 cores and 8 L2 banks & NCU
4-stage pipeline: req, arbitration, selection, transmission
2-deep queue for each src/dest pair to hold data transfer request
29
![Page 30: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/30.jpg)
Buffered Crossbar
30
Output
Arbiter
Output
Arbiter
Output
Arbiter
Output
Arbiter
Flow
Control
Flow
Control
Flow
Control
Flow
Control
NI
NI
NI
NI
Buffered
Crossbar
0
1
2
3
NI
NI
NI
NI
Bufferless
Crossbar
0
1
2
3
+ Simpler arbitration/ scheduling
+ Efficient support for variable-size packets
- Requires N2 buffers
![Page 31: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/31.jpg)
Can We Get Lower Cost than A Crossbar?
Yet still have low contention?
Idea: Multistage networks
31
![Page 32: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/32.jpg)
Multistage Logarithmic Networks
Idea: Indirect networks with multiple layers of switches between terminals/nodes
Cost: O(NlogN), Latency: O(logN)
Many variations (Omega, Butterfly, Benes, Banyan, …)
Omega Network:
32
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
Omega Net w or k
conflict
![Page 33: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/33.jpg)
Multistage Circuit Switched
More restrictions on feasible concurrent Tx-Rx pairs
But more scalable than crossbar in cost, e.g., O(N logN) for Butterfly
33
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
2-by-2 crossbar
![Page 34: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/34.jpg)
Multistage Packet Switched
Packets “hop” from router to router, pending availability of the next-required switch and buffer
34
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
2-by-2 router
![Page 35: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/35.jpg)
Aside: Circuit vs. Packet Switching
Circuit switching sets up full path
Establish route then send data
(no one else can use those links)
+ faster arbitration
-- setting up and bringing down links takes time
Packet switching routes per packet
Route each packet individually (possibly via different paths)
if link is free, any packet can use it
-- potentially slower --- must dynamically switch
+ no setup, bring down time
+ more flexible, does not underutilize links
35
![Page 36: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/36.jpg)
Switching vs. Topology
Circuit/packet switching choice independent of topology
It is a higher-level protocol on how a message gets sent to a destination
However, some topologies are more amenable to circuit vs. packet switching
36
![Page 37: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/37.jpg)
Another Example: Delta Network
Single path from source to destination
Does not support all possible permutations
Proposed to replace costly crossbars as processor-memory interconnect
Janak H. Patel ,“Processor-Memory Interconnections for Multiprocessors,” ISCA 1979.
37
8x8 Delta network
![Page 38: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/38.jpg)
Another Example: Omega Network
Single path from source to destination
All stages are the same
Used in NYU Ultracomputer
Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982.
38
![Page 39: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/39.jpg)
Ring
+ Cheap: O(N) cost
- High latency: O(N)
- Not easy to scale
- Bisection bandwidth remains constant
Used in Intel Haswell, Intel Larrabee, IBM Cell, many commercial systems today
39
M
P
RING
S
M
P
S
M
P
S
![Page 40: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/40.jpg)
Unidirectional Ring
Simple topology and implementation
Reasonable performance if N and performance needs (bandwidth & latency) still moderately low
O(N) cost
N/2 average hops; latency depends on utilization
40
R
0
R
1
R
N-2
R
N-1
2
2x2 router
![Page 41: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/41.jpg)
Bidirectional Rings
+ Reduces latency
+ Improves scalability
- Slightly more complex injection policy (need to select which ring to inject a packet into)
41
![Page 42: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/42.jpg)
Hierarchical Rings
+ More scalable
+ Lower latency
- More complex
42
![Page 43: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/43.jpg)
More on Hierarchical Rings
Chris Fallin, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, Greg Nazario, Reetuparna Das, and Onur Mutlu, "HiRD: A Low-Complexity, Energy-Efficient Hierarchical Ring Interconnect" SAFARI Technical Report, TR-SAFARI-2012-004, Carnegie Mellon University, December 2012.
Discusses the design and implementation of a mostly-bufferless hierarchical ring
43
![Page 44: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/44.jpg)
Mesh
O(N) cost
Average latency: O(sqrt(N))
Easy to layout on-chip: regular and equal-length links
Path diversity: many ways to get from one node to another
Used in Tilera 100-core
And many on-chip network
prototypes
44
![Page 45: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/45.jpg)
Torus
Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle
Torus avoids this problem
+ Higher path diversity (and bisection bandwidth) than mesh
- Higher cost
- Harder to lay out on-chip
- Unequal link lengths
45
![Page 46: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/46.jpg)
Torus, continued
Weave nodes to make inter-node latencies ~constant
46
SM P
SM P
SM P
SM P
SM P
SM P
SM P
SM P
![Page 47: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/47.jpg)
Planar, hierarchical topology
Latency: O(logN)
Good for local traffic
+ Cheap: O(N) cost
+ Easy to Layout
- Root can become a bottleneck
Fat trees avoid this problem (CM-5)
Trees
47
Fat Tree
![Page 48: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/48.jpg)
CM-5 Fat Tree
Fat tree based on 4x2 switches
Randomized routing on the way up
Combining, multicast, reduction operators supported in hardware
Thinking Machines Corp., “The Connection Machine CM-5 Technical Summary,” Jan. 1992.
48
![Page 49: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/49.jpg)
Hypercube
Latency: O(logN)
Radix: O(logN)
#links: O(NlogN)
+ Low latency
- Hard to lay out in 2D/3D
49
00
00
01
01
01
00
00
01
00
11
00
10
01
10
01
11
10
00
11
01
11
00
10
01
10
11
10
10
11
10
11
11
![Page 50: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/50.jpg)
Caltech Cosmic Cube
64-node message passing machine
Seitz, “The Cosmic Cube,” CACM 1985.
50
![Page 51: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/51.jpg)
Handling Contention
Two packets trying to use the same link at the same time
What do you do?
Buffer one
Drop one
Misroute one (deflection)
Tradeoffs?
51
![Page 52: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/52.jpg)
Destination
Bufferless Deflection Routing
Key idea: Packets are never buffered in the network. When two packets contend for the same link, one is deflected.1
52 1Baran, “On Distributed Communication Networks.” RAND Tech. Report., 1962 / IEEE Trans.Comm., 1964.
New traffic can be injected whenever there is a free output link.
![Page 53: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/53.jpg)
Bufferless Deflection Routing
Input buffers are eliminated: flits are buffered in pipeline latches and on network links
53
North
South
East
West
Local
North
South
East
West
Local
Deflection Routing Logic
Input Buffers
![Page 54: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/54.jpg)
Routing Algorithm
Types
Deterministic: always chooses the same path for a communicating source-destination pair
Oblivious: chooses different paths, without considering network state
Adaptive: can choose different paths, adapting to the state of the network
How to adapt
Local/global feedback
Minimal or non-minimal paths
54
![Page 55: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/55.jpg)
Deterministic Routing
All packets between the same (source, dest) pair take the same path
Dimension-order routing
E.g., XY routing (used in Cray T3D, and many on-chip networks)
First traverse dimension X, then traverse dimension Y
+ Simple
+ Deadlock freedom (no cycles in resource allocation)
- Could lead to high contention
- Does not exploit path diversity
55
![Page 56: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/56.jpg)
Deadlock
No forward progress
Caused by circular dependencies on resources
Each packet waits for a buffer occupied by another packet downstream
56
![Page 57: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/57.jpg)
Handling Deadlock
Avoid cycles in routing
Dimension order routing
Cannot build a circular dependency
Restrict the “turns” each packet can take
Avoid deadlock by adding more buffering (escape paths)
Detect and break deadlock
Preemption of buffers
57
![Page 58: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/58.jpg)
Turn Model to Avoid Deadlock
Idea
Analyze directions in which packets can turn in the network
Determine the cycles that such turns can form
Prohibit just enough turns to break possible cycles
Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992.
58
![Page 59: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/59.jpg)
Oblivious Routing: Valiant’s Algorithm
An example of oblivious algorithm
Goal: Balance network load
Idea: Randomly choose an intermediate destination, route to it first, then route from there to destination
Between source-intermediate and intermediate-dest, can use dimension order routing
+ Randomizes/balances network load
- Non minimal (packet latency can increase)
Optimizations:
Do this on high load
Restrict the intermediate node to be close (in the same quadrant)
59
![Page 60: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/60.jpg)
Adaptive Routing
Minimal adaptive
Router uses network state (e.g., downstream buffer occupancy) to pick which “productive” output port to send a packet to
Productive output port: port that gets the packet closer to its destination
+ Aware of local congestion
- Minimality restricts achievable link utilization (load balance)
Non-minimal (fully) adaptive
“Misroute” packets to non-productive output ports based on network state
+ Can achieve better network utilization and load balance
- Need to guarantee livelock freedom
60
![Page 61: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/61.jpg)
On-Chip Networks
61
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R Router
Processing Element (Cores, L2 Banks, Memory Controllers, etc)
PE
• Connect cores, caches, memory controllers, etc
– Buses and crossbars are not scalable
• Packet switched
• 2D mesh: Most commonly used topology
• Primarily serve cache misses and memory requests
![Page 62: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/62.jpg)
© Onur Mutlu, 2009, 2010
On-chip Networks
62
From East
From West
From North
From South
From PE
VC 0
VC Identifier
VC 1
VC 2
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
Crossbar ( 5 x 5 )
To East
To PE
To West
To North
To South
Input Port with Buffers
Control Logic
Crossbar
R Router
PE Processing Element (Cores, L2 Banks, Memory Controllers etc)
Routing Unit ( RC )
VC Allocator ( VA )
Switch Allocator (SA)
![Page 63: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/63.jpg)
© Onur Mutlu, 2009, 2010
On-Chip vs. Off-Chip Interconnects
On-chip advantages
Low latency between cores
No pin constraints
Rich wiring resources
Very high bandwidth
Simpler coordination
On-chip constraints/disadvantages
2D substrate limits implementable topologies
Energy/power consumption a key concern
Complex algorithms undesirable
Logic area constrains use of wiring resources
63
![Page 64: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/64.jpg)
© Onur Mutlu, 2009, 2010
On-Chip vs. Off-Chip Interconnects (II)
Cost
Off-chip: Channels, pins, connectors, cables
On-chip: Cost is storage and switches (wires are plentiful)
Leads to networks with many wide channels, few buffers
Channel characteristics
On chip short distance low latency
On chip RC lines need repeaters every 1-2mm
Can put logic in repeaters
Workloads
Multi-core cache traffic vs. supercomputer interconnect traffic
64
![Page 65: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/65.jpg)
Motivation for Efficient Interconnect
In many-core chips, on-chip interconnect (NoC) consumes significant power
Intel Terascale: ~28% of chip power
Intel SCC: ~10%
MIT RAW: ~36%
Recent work1 uses bufferless deflection routing to reduce power and die area
65
Core L1
L2 Slice Router
1Moscibroda and Mutlu, “A Case for Bufferless Deflection Routing in On-Chip Networks.” ISCA 2009.
![Page 66: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/66.jpg)
Research in Interconnects
![Page 67: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/67.jpg)
Research Topics in Interconnects
Plenty of topics in interconnection networks. Examples:
Energy/power efficient and proportional design
Reducing Complexity: Simplified router and protocol designs
Adaptivity: Ability to adapt to different access patterns
QoS and performance isolation
Reducing and controlling interference, admission control
Co-design of NoCs with other shared resources
End-to-end performance, QoS, power/energy optimization
Scalable topologies to many cores, heterogeneous systems
Fault tolerance
Request prioritization, priority inversion, coherence, …
New technologies (optical, 3D)
67
![Page 68: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/68.jpg)
One Example: Packet Scheduling
Which packet to choose for a given output port?
Router needs to prioritize between competing flits
Which input port?
Which virtual channel?
Which application’s packet?
Common strategies
Round robin across virtual channels
Oldest packet first (or an approximation)
Prioritize some virtual channels over others
Better policies in a multi-core environment
Use application characteristics
Minimize energy
68
![Page 69: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/69.jpg)
Computer Architecture:
Interconnects (Part I)
Prof. Onur Mutlu
Carnegie Mellon University
![Page 70: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/70.jpg)
Bufferless Routing
Thomas Moscibroda and Onur Mutlu, "A Case for Bufferless Routing in On-Chip Networks"
Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 196-207, Austin, TX, June 2009.
Slides (pptx)
70
![Page 71: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/71.jpg)
$
• Connect cores, caches, memory controllers, etc…
• Examples:
• Intel 80-core Terascale chip
• MIT RAW chip
• Design goals in NoC design:
• High throughput, low latency
• Fairness between cores, QoS, …
• Low complexity, low cost
• Power, low energy consumption
On-Chip Networks (NoC)
Energy/Power in On-Chip Networks
• Power is a key constraint in the design
of high-performance processors
• NoCs consume substantial portion of system
power
• ~30% in Intel 80-core Terascale [IEEE
Micro’07]
• ~40% in MIT RAW Chip [ISCA’04]
• NoCs estimated to consume 100s of Watts
[Borkar, DAC’07]
![Page 72: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/72.jpg)
$
• Existing approaches differ in numerous ways:
• Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc]
• Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc]
• Virtual Channels [Nicopoulos et al, MICRO’06, etc]
• QoS & fairness mechanisms [Lee et al, ISCA’08, etc]
• Routing algorithms [Singh et al, CAL’04]
• Router architecture [Park et al, ISCA’08]
• Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08]
Current NoC Approaches
Existing work assumes existence of
buffers in routers!
![Page 73: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/73.jpg)
$
A Typical Router
Routing Computation
VC Arbiter
Switch Arbiter
VC1
VC2
VCv
VC1
VC2
VCv
Input Port N
Input Port 1
N x N Crossbar
Input Channel 1
Input Channel N
Scheduler
Output Channel 1
Output Channel N
Credit Flow
to upstream
router
Buffers are integral part of
existing NoC Routers
Credit Flow
to upstream
router
![Page 74: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/74.jpg)
$
• Buffers are necessary for high network throughput
buffers increase total available bandwidth in network
Buffers in NoC Routers
Injection Rate
Avg
. pac
ket
late
ncy
large
buffers
medium
buffers
small
buffers
![Page 75: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/75.jpg)
$
• Buffers are necessary for high network throughput
buffers increase total available bandwidth in network
• Buffers consume significant energy/power
• Dynamic energy when read/write
• Static energy even when not occupied
• Buffers add complexity and latency
• Logic for buffer management
• Virtual channel allocation
• Credit-based flow control
• Buffers require significant chip area
• E.g., in TRIPS prototype chip, input buffers occupy 75% of
total on-chip network area [Gratz et al, ICCD’06]
Buffers in NoC Routers
![Page 76: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/76.jpg)
$
• How much throughput do we lose?
How is latency affected?
• Up to what injection rates can we use bufferless routing?
Are there realistic scenarios in which NoC is
operated at injection rates below the threshold?
• Can we achieve energy reduction?
If so, how much…?
• Can we reduce area, complexity, etc…?
Going Bufferless…?
Injection Rate
late
ncy
buffers no
buffers
![Page 77: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/77.jpg)
$
• Always forward all incoming flits to some output port
• If no productive direction is available, send to another
direction
• packet is deflected
Hot-potato routing [Baran’64, etc]
BLESS: Bufferless Routing
Buffered BLESS
Deflected!
![Page 78: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/78.jpg)
$
BLESS: Bufferless Routing
Routing
VC Arbiter
Switch Arbiter
Flit-Ranking
Port-
Prioritization
arbitration policy
Flit-Ranking 1. Create a ranking over all incoming flits
Port-
Prioritization 2. For a given flit in this ranking, find the best free output-port
Apply to each flit in order of ranking
![Page 79: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/79.jpg)
$
• Each flit is routed independently.
• Oldest-first arbitration (other policies evaluated in paper)
• Network Topology: Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …) 1) #output ports ¸ #input ports at every router 2) every router is reachable from every other router
• Flow Control & Injection Policy:
Completely local, inject whenever input port is free
• Absence of Deadlocks: every flit is always moving
• Absence of Livelocks: with oldest-first ranking
FLIT-BLESS: Flit-Level Routing
Flit-Ranking 1. Oldest-first ranking
Port-
Prioritization 2. Assign flit to productive port, if possible.
Otherwise, assign to non-productive port.
![Page 80: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/80.jpg)
$
• BLESS without buffers is extreme end of a continuum
• BLESS can be integrated with buffers
• FLIT-BLESS with Buffers
• WORM-BLESS with Buffers
• Whenever a buffer is full, it’s first flit becomes
must-schedule
• must-schedule flits must be deflected if necessary
BLESS with Buffers
See paper for details…
![Page 81: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/81.jpg)
$
Advantages
• No buffers
• Purely local flow control
• Simplicity - no credit-flows
- no virtual channels
- simplified router design
• No deadlocks, livelocks
• Adaptivity - packets are deflected around
congested areas!
• Router latency reduction
• Area savings
BLESS: Advantages & Disadvantages
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at receiver
• Header information at each flit
• Oldest-first arbitration complex
• QoS becomes difficult
Impact on energy…?
![Page 82: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/82.jpg)
$
• BLESS gets rid of input buffers
and virtual channels
Reduction of Router Latency
BW
RC
VA
SA ST
LT
BW SA ST LT
RC ST LT
RC ST LT
LA LT
BW: Buffer Write
RC: Route Computation
VA: Virtual Channel Allocation
SA: Switch Allocation
ST: Switch Traversal
LT: Link Traversal
LA LT: Link Traversal of Lookahead
Baseline
Router
(speculative)
head
flit
body
flit
BLESS
Router
(standard)
RC ST LT
RC ST LT
Router 1
Router 2
Router 1
Router 2
BLESS
Router
(optimized)
Router Latency = 3
Router Latency = 2
Router Latency = 1
Can be improved to 2.
[Dally, Towles’04]
![Page 83: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/83.jpg)
$
Advantages
• No buffers
• Purely local flow control
• Simplicity - no credit-flows
- no virtual channels
- simplified router design
• No deadlocks, livelocks
• Adaptivity - packets are deflected around
congested areas!
• Router latency reduction
• Area savings
BLESS: Advantages & Disadvantages
Disadvantages
• Increased latency
• Reduced bandwidth
• Increased buffering at
receiver
• Header information at
each flit
Impact on energy…?
Extensive evaluations in the paper!
![Page 84: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/84.jpg)
$
• 2D mesh network, router latency is 2 cycles
o 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2 bank)
o 4x4, 16 core, 16 L2 cache banks (each node is a core and an L2 bank)
o 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and may be a core)
o 128-bit wide links, 4-flit data packets, 1-flit address packets
o For baseline configuration: 4 VCs per physical input port, 1 packet deep
• Benchmarks
o Multiprogrammed SPEC CPU2006 and Windows Desktop applications
o Heterogeneous and homogenous application mixes
o Synthetic traffic patterns: UR, Transpose, Tornado, Bit Complement
• x86 processor model based on Intel Pentium M
o 2 GHz processor, 128-entry instruction window
o 64Kbyte private L1 caches
o Total 16Mbyte shared L2 caches; 16 MSHRs per bank
o DRAM model based on Micron DDR2-800
Evaluation Methodology
![Page 85: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/85.jpg)
$
• Energy model provided by Orion simulator [MICRO’02]
o 70nm technology, 2 GHz routers at 1.0 Vdd
• For BLESS, we model
o Additional energy to transmit header information
o Additional buffers needed on the receiver side
o Additional logic to reorder flits of individual packets at receiver
• We partition network energy into
buffer energy, router energy, and link energy,
each having static and dynamic components.
• Comparisons against non-adaptive and aggressive
adaptive buffered routing algorithms (DO, MIN-AD, ROMM)
Evaluation Methodology
![Page 86: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/86.jpg)
$
Evaluation – Synthetic Traces
• First, the bad news
• Uniform random injection
• BLESS has significantly lower
saturation throughput
compared to buffered
baseline.
0 10 20 30 40 50 60 70 80 90
100
0
0.0
7
0.1
0.1
3
0.1
6
0.1
9
0.2
2
0.2
5
0.2
8
0.3
1
0.3
4
0.3
7
0.4
0.4
3
0.4
6
0.4
9
Ave
rage
Late
ncy
Injection Rate (flits per cycle per node)
FLIT-2
WORM-2
FLIT-1
WORM-1
MIN-AD
BLESS Best
Baseline
![Page 87: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/87.jpg)
$
Evaluation – Homogenous Case Study
• milc benchmarks
(moderately intensive)
• Perfect caches!
• Very little performance
degradation with BLESS
(less than 4% in dense
network)
• With router latency 1,
BLESS can even
outperform baseline
(by ~10%)
• Significant energy
improvements
(almost 40%)
0 2 4 6 8
10 12 14 16 18
W-S
peed
up
4x4, 8x milc 4x4, 16x milc 8x8, 16x milc
0
0.2
0.4
0.6
0.8
1
1.2 E
nerg
y (
no
rmalized
) BufferEnergy LinkEnergy RouterEnergy
4x4, 16x milc 8x8, 16x milc 4x4, 8x milc
Baseline BLESS RL=1
![Page 88: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/88.jpg)
$
Evaluation – Homogenous Case Study
0 2 4 6 8
10 12 14 16 18
W-S
peed
up
4x4, 8x milc 4x4, 16x milc 8x8, 16x milc
0
0.2
0.4
0.6
0.8
1
1.2 E
nerg
y (
no
rmalized
) BufferEnergy LinkEnergy RouterEnergy
4x4, 8 8x milc 4x4, 16x milc 8x8, 16x milc
Baseline BLESS RL=1
• milc benchmarks
(moderately intensive)
• Perfect caches!
• Very little performance
degradation with BLESS
(less than 4% in dense
network)
• With router latency 1,
BLESS can even
outperform baseline
(by ~10%)
• Significant energy
improvements
(almost 40%)
Observations:
1) Injection rates not extremely high
on average
self-throttling!
2) For bursts and temporary hotspots,
use network links as buffers!
![Page 89: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/89.jpg)
$
Evaluation – Further Results
• BLESS increases buffer requirement
at receiver by at most 2x overall, energy is still reduced
• Impact of memory latency
with real caches, very little slowdown! (at most 1.5%)
See paper for details…
0 2 4 6 8
10 12 14 16 18
DO
MIN
-AD
RO
MM
FLIT
-2
WO
RM
-2
FLIT
-1
WO
RM
-1
DO
MIN
-AD
RO
MM
FLIT
-2
WO
RM
-2
FLIT
-1
WO
RM
-1
DO
MIN
-AD
RO
MM
FLIT
-2
WO
RM
-2
FLIT
-1
WO
RM
-1 W
-Sp
eed
up
4x4, 8x matlab 4x4, 16x matlab
8x8, 16x matlab
![Page 90: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/90.jpg)
$
Evaluation – Further Results
• BLESS increases buffer requirement
at receiver by at most 2x overall, energy is still reduced
• Impact of memory latency
with real caches, very little slowdown! (at most 1.5%)
• Heterogeneous application mixes
(we evaluate several mixes of intensive and non-intensive applications)
little performance degradation
significant energy savings in all cases
no significant increase in unfairness across different applications
• Area savings: ~60% of network area can be saved!
See paper for details…
![Page 91: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/91.jpg)
$
• Aggregate results over all 29 applications
Evaluation – Aggregate Results
Sparse Network Perfect L2 Realistic L2
Average Worst-Case Average Worst-Case
∆ Network Energy -39.4% -28.1% -46.4% -41.0%
∆ System Performance -0.5% -3.2% -0.15% -0.55%
0
0.2
0.4
0.6
0.8
1
Mean Worst-Case
En
erg
y
(no
rmalized
)
BufferEnergy LinkEnergy RouterEnergy
FLIT WORM BASE FLIT WORM BASE 0 1 2 3 4 5 6 7 8
Mean Worst-Case
W-S
peed
up
FLIT
WO
RM
BA
SE
FLIT
WO
RM
BA
SE
![Page 92: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/92.jpg)
$
• Aggregate results over all 29 applications
Evaluation – Aggregate Results
Sparse Network Perfect L2 Realistic L2
Average Worst-Case Average Worst-Case
∆ Network Energy -39.4% -28.1% -46.4% -41.0%
∆ System Performance -0.5% -3.2% -0.15% -0.55%
Dense Network Perfect L2 Realistic L2
Average Worst-Case Average Worst-Case
∆ Network Energy -32.8% -14.0% -42.5% -33.7%
∆ System Performance -3.6% -17.1% -0.7% -1.5%
![Page 93: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/93.jpg)
$
• For a very wide range of applications and network settings, buffers are not needed in NoC
• Significant energy savings (32% even in dense networks and perfect caches)
• Area-savings of 60%
• Simplified router and network design (flow control, etc…)
• Performance slowdown is minimal (can even increase!)
A strong case for a rethinking of NoC design!
• Future research:
• Support for quality of service, different traffic classes, energy-management, etc…
BLESS Conclusions
![Page 94: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/94.jpg)
CHIPPER: A Low-complexity
Bufferless Deflection Router
Chris Fallin, Chris Craik, and Onur Mutlu, "CHIPPER: A Low-Complexity Bufferless Deflection Router"
Proceedings of the 17th International Symposium on High-Performance Computer Architecture (HPCA), pages 144-155, San Antonio, TX, February
2011. Slides (pptx)
![Page 95: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/95.jpg)
Motivation
Recent work has proposed bufferless deflection routing (BLESS [Moscibroda, ISCA 2009])
Energy savings: ~40% in total NoC energy
Area reduction: ~40% in total NoC area
Minimal performance loss: ~4% on average
Unfortunately: unaddressed complexities in router
long critical path, large reassembly buffers
Goal: obtain these benefits while simplifying the router
in order to make bufferless NoCs practical.
95
![Page 96: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/96.jpg)
Problems that Bufferless Routers Must Solve
1. Must provide livelock freedom
A packet should not be deflected forever
2. Must reassemble packets upon arrival
96
Flit: atomic routing unit
0 1 2 3
Packet: one or multiple flits
![Page 97: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/97.jpg)
Local Node
Router
Inject
Deflection Routing Logic
Crossbar
A Bufferless Router: A High-Level View
97
Reassembly Buffers
Eject
Problem 2: Packet Reassembly
Problem 1: Livelock Freedom
![Page 98: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/98.jpg)
Complexity in Bufferless Deflection Routers
1. Must provide livelock freedom
Flits are sorted by age, then assigned in age order to output ports
43% longer critical path than buffered router
2. Must reassemble packets upon arrival
Reassembly buffers must be sized for worst case
4KB per node
(8x8, 64-byte cache block)
98
![Page 99: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/99.jpg)
Inject
Deflection Routing Logic
Crossbar
Problem 1: Livelock Freedom
99
Reassembly Buffers
Eject Problem 1: Livelock Freedom
![Page 100: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/100.jpg)
Livelock Freedom in Previous Work
What stops a flit from deflecting forever?
All flits are timestamped
Oldest flits are assigned their desired ports
Total order among flits
But what is the cost of this?
100
Flit age forms total order
Guaranteed progress!
< < < < <
New traffic is lowest priority
![Page 101: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/101.jpg)
Age-Based Priorities are Expensive: Sorting
Router must sort flits by age: long-latency sort network
Three comparator stages for 4 flits
101
4
1
2
3
![Page 102: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/102.jpg)
Age-Based Priorities Are Expensive: Allocation
After sorting, flits assigned to output ports in priority order
Port assignment of younger flits depends on that of older flits
sequential dependence in the port allocator
102
East? GRANT: Flit 1 East
DEFLECT: Flit 2 North
GRANT: Flit 3 South
DEFLECT: Flit 4 West
East?
{N,S,W}
{S,W}
{W}
South?
South?
Age-Ordered Flits
1
2
3
4
![Page 103: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/103.jpg)
Age-Based Priorities Are Expensive
Overall, deflection routing logic based on Oldest-First has a 43% longer critical path than a buffered router
Question: is there a cheaper way to route while guaranteeing livelock-freedom?
103
Port Allocator Priority Sort
![Page 104: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/104.jpg)
Solution: Golden Packet for Livelock Freedom
What is really necessary for livelock freedom?
Key Insight: No total order. it is enough to:
1. Pick one flit to prioritize until arrival
2. Ensure any flit is eventually picked
104
Flit age forms total order
Guaranteed progress!
New traffic is lowest-priority
< < <
Guaranteed progress!
<
“Golden Flit”
partial ordering is sufficient!
![Page 105: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/105.jpg)
Only need to properly route the Golden Flit
First Insight: no need for full sort
Second Insight: no need for sequential allocation
What Does Golden Flit Routing Require?
105
Port Allocator Priority Sort
![Page 106: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/106.jpg)
Golden Flit Routing With Two Inputs
Let’s route the Golden Flit in a two-input router first
Step 1: pick a “winning” flit: Golden Flit, else random
Step 2: steer the winning flit to its desired output
and deflect other flit
Golden Flit always routes toward destination
106
![Page 107: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/107.jpg)
Golden Flit Routing with Four Inputs
107
Each block makes decisions independently!
Deflection is a distributed decision
N
E
S
W
N
S
E
W
![Page 108: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/108.jpg)
Permutation Network Operation
108
N
E
S
W
wins swap!
wins no swap! wins no swap!
deflected
Golden:
wins swap!
x
Port Allocator Priority Sort
N
E
S
W
N
S
E
W
![Page 109: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/109.jpg)
Problem 2: Packet Reassembly
109
Inject/Eject
Reassembly Buffers
Inject Eject
![Page 110: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/110.jpg)
Reassembly Buffers are Large
Worst case: every node sends a packet to one receiver
Why can’t we make reassembly buffers smaller?
110
Node 0
Node 1
Node N-1
Receiver
one packet in flight per node
N sending nodes …
O(N) space!
![Page 111: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/111.jpg)
Small Reassembly Buffers Cause Deadlock
What happens when reassembly buffer is too small?
111
Network
cannot eject: reassembly buffer full
reassembly buffer
Many Senders
One Receiver
Remaining flits must inject for forward progress
cannot inject new traffic
network full
![Page 112: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/112.jpg)
Reserve Space to Avoid Deadlock?
What if every sender asks permission from the receiver before it sends?
adds additional delay to every request
112
reassembly buffers
Reserve Slot?
Reserved
ACK
Sender
1. Reserve Slot 2. ACK 3. Send Packet
Receiver
![Page 113: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/113.jpg)
Escaping Deadlock with Retransmissions
Sender is optimistic instead: assume buffer is free
If not, receiver drops and NACKs; sender retransmits
no additional delay in best case
transmit buffering overhead for all packets
potentially many retransmits
113
Reassembly Buffers
Retransmit Buffers
NACK!
Sender
ACK
Receiver
1. Send (2 flits) 2. Drop, NACK 3. Other packet completes 4. Retransmit packet 5. ACK 6. Sender frees data
![Page 114: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/114.jpg)
Solution: Retransmitting Only Once
Key Idea: Retransmit only when space becomes available.
Receiver drops packet if full; notes which packet it drops
When space frees up, receiver reserves space so
retransmit is successful
Receiver notifies sender to retransmit
114
Reassembly Buffers
Retransmit Buffers
NACK
Sender
Reserved
Receiver
Pending: Node 0 Req 0
![Page 115: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/115.jpg)
Using MSHRs as Reassembly Buffers
115
Inject/Eject
Reassembly Buffers
Inject Eject
Miss Buffers (MSHRs)
C Using miss buffers for
reassembly makes this a truly bufferless network.
![Page 116: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/116.jpg)
Inject
Deflection Routing Logic
Crossbar
CHIPPER: Cheap Interconnect Partially-Permuting Router
116
Reassembly Buffers
Eject
Baseline Bufferless Deflection Router
Large buffers for worst case Retransmit-Once Cache buffers
Long critical path: 1. Sort by age 2. Allocate ports sequentially Golden Packet Permutation Network
![Page 117: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/117.jpg)
CHIPPER: Cheap Interconnect Partially-Permuting Router
117
Inject/Eject
Miss Buffers (MSHRs)
Inject Eject
![Page 118: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/118.jpg)
EVALUATION
118
![Page 119: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/119.jpg)
Methodology
Multiprogrammed workloads: CPU2006, server, desktop
8x8 (64 cores), 39 homogeneous and 10 mixed sets
Multithreaded workloads: SPLASH-2, 16 threads
4x4 (16 cores), 5 applications
System configuration
Buffered baseline: 2-cycle router, 4 VCs/channel, 8 flits/VC
Bufferless baseline: 2-cycle latency, FLIT-BLESS
Instruction-trace driven, closed-loop, 128-entry OoO window
64KB L1, perfect L2 (stresses interconnect), XOR mapping
119
![Page 120: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/120.jpg)
Methodology
Hardware modeling
Verilog models for CHIPPER, BLESS, buffered logic
Synthesized with commercial 65nm library
ORION for crossbar, buffers and links
Power
Static and dynamic power from hardware models
Based on event counts in cycle-accurate simulations
120
![Page 121: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/121.jpg)
0
0.2
0.4
0.6
0.8
1
luc
cho
lesk
y
rad
ix
fft
lun
AV
G
Spee
du
p (
No
rmal
ize
d)
Multithreaded
0
8
16
24
32
40
48
56
64
per
lben
ch
ton
to
gcc
h2
64
ref
vpr
sear
ch.1
MIX
.5
MIX
.2
MIX
.8
MIX
.0
MIX
.6
Gem
sFD
TD
stre
am
mcf
AV
G (
full
set)
We
igh
ted
Sp
ee
du
p
Multiprogrammed (subset of 49 total) Buffered
BLESS
CHIPPER
Results: Performance Degradation
121
13.6% 1.8%
3.6% 49.8%
C Minimal loss for low-to-medium-intensity workloads
![Page 122: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/122.jpg)
0
2
4
6
8
10
12
14
16
18
per
lben
ch
ton
to
gcc
h2
64
ref
vpr
sear
ch.1
MIX
.5
MIX
.2
MIX
.8
MIX
.0
MIX
.6
Gem
sFD
TD
stre
am
mcf
AV
G (
full
set)
Ne
two
rk P
ow
er
(W)
Multiprogrammed (subset of 49 total)
Buffered
BLESS
CHIPPER
Results: Power Reduction
122
0
0.5
1
1.5
2
2.5
luc
cho
lesk
y
rad
ix
fft
lun
AV
G
Multithreaded
54.9% 73.4%
C Removing buffers majority of power savings
C Slight savings from BLESS to CHIPPER
![Page 123: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/123.jpg)
Results: Area and Critical Path Reduction
123
0
0.25
0.5
0.75
1
1.25
1.5
Buffered BLESS CHIPPER
Normalized Router Area
0
0.25
0.5
0.75
1
1.25
1.5
Buffered BLESS CHIPPER
Normalized Critical Path
-36.2%
-29.1%
+1.1%
-1.6%
C CHIPPER maintains area savings of BLESS
C Critical path becomes competitive to buffered
![Page 124: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/124.jpg)
Conclusions Two key issues in bufferless deflection routing
livelock freedom and packet reassembly
Bufferless deflection routers were high-complexity and impractical
Oldest-first prioritization long critical path in router
No end-to-end flow control for reassembly prone to deadlock with
reasonably-sized reassembly buffers
CHIPPER is a new, practical bufferless deflection router
Golden packet prioritization short critical path in router
Retransmit-once protocol deadlock-free packet reassembly
Cache miss buffers as reassembly buffers truly bufferless network
CHIPPER frequency comparable to buffered routers at much lower area and power cost, and minimal performance loss
124
![Page 125: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/125.jpg)
MinBD:
Minimally-Buffered Deflection Routing
for Energy-Efficient Interconnect
Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu,
"MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect"
Proceedings of the 6th ACM/IEEE International Symposium on Networks on Chip (NOCS), Lyngby, Denmark, May 2012. Slides (pptx) (pdf)
![Page 126: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/126.jpg)
Bufferless Deflection Routing Key idea: Packets are never buffered in the network. When two
packets contend for the same link, one is deflected.
Removing buffers yields significant benefits
Reduces power (CHIPPER: reduces NoC power by 55%)
Reduces die area (CHIPPER: reduces NoC area by 36%)
But, at high network utilization (load), bufferless deflection routing causes unnecessary link & router traversals
Reduces network throughput and application performance
Increases dynamic power
Goal: Improve high-load performance of low-cost deflection networks by reducing the deflection rate.
126
![Page 127: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/127.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
127
![Page 128: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/128.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
128
![Page 129: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/129.jpg)
Issues in Bufferless Deflection Routing
Correctness: Deliver all packets without livelock
CHIPPER1: Golden Packet
Globally prioritize one packet until delivered
Correctness: Reassemble packets without deadlock
CHIPPER1: Retransmit-Once
Performance: Avoid performance degradation at high load
MinBD
129 1 Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA
2011.
![Page 130: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/130.jpg)
Key Performance Issues
1. Link contention: no buffers to hold traffic any link contention causes a deflection
use side buffers
2. Ejection bottleneck: only one flit can eject per router per cycle simultaneous arrival causes deflection
eject up to 2 flits/cycle
3. Deflection arbitration: practical (fast) deflection arbiters deflect unnecessarily
new priority scheme (silver flit)
130
![Page 131: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/131.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
131
![Page 132: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/132.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
132
![Page 133: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/133.jpg)
Addressing Link Contention
Problem 1: Any link contention causes a deflection
Buffering a flit can avoid deflection on contention
But, input buffers are expensive:
All flits are buffered on every hop high dynamic energy
Large buffers necessary high static energy and large area
Key Idea 1: add a small buffer to a bufferless deflection router to buffer only flits that would have been deflected
133
![Page 134: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/134.jpg)
How to Buffer Deflected Flits
134
Baseline Router Eject Inject
1 Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA
2011.
Destination
Destination
DEFLECTED
![Page 135: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/135.jpg)
How to Buffer Deflected Flits
135
Side-Buffered Router Eject Inject
Step 1. Remove up to
one deflected flit per
cycle from the outputs.
Step 2. Buffer this flit in a small
FIFO “side buffer.”
Step 3. Re-inject this flit into
pipeline when a slot is available.
Side Buffer
Destination
Destination
DEFLECTED
![Page 136: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/136.jpg)
Why Could A Side Buffer Work Well?
Buffer some flits and deflect other flits at per-flit level
Relative to bufferless routers, deflection rate reduces (need not deflect all contending flits)
4-flit buffer reduces deflection rate by 39%
Relative to buffered routers, buffer is more efficiently used (need not buffer all flits)
similar performance with 25% of buffer space
136
![Page 137: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/137.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
137
![Page 138: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/138.jpg)
Addressing the Ejection Bottleneck
Problem 2: Flits deflect unnecessarily because only one flit can eject per router per cycle
In 20% of all ejections, ≥ 2 flits could have ejected all but one flit must deflect and try again
these deflected flits cause additional contention
Ejection width of 2 flits/cycle reduces deflection rate 21%
Key idea 2: Reduce deflections due to a single-flit ejection port by allowing two flits to eject per cycle
138
![Page 139: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/139.jpg)
Addressing the Ejection Bottleneck
139
Single-Width Ejection Eject Inject
DEFLECTED
![Page 140: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/140.jpg)
Addressing the Ejection Bottleneck
140
Dual-Width Ejection Eject Inject
For fair comparison, baseline routers have dual-width ejection for perf. (not power/area)
![Page 141: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/141.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
141
![Page 142: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/142.jpg)
Improving Deflection Arbitration
Problem 3: Deflections occur unnecessarily because fast arbiters must use simple priority schemes
Age-based priorities (several past works): full priority order gives fewer deflections, but requires slow arbiters
State-of-the-art deflection arbitration (Golden Packet & two-stage permutation network)
Prioritize one packet globally (ensure forward progress)
Arbitrate other flits randomly (fast critical path)
Random common case leads to uncoordinated arbitration
142
![Page 143: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/143.jpg)
Fast Deflection Routing Implementation
Let’s route in a two-input router first:
Step 1: pick a “winning” flit (Golden Packet, else random)
Step 2: steer the winning flit to its desired output
and deflect other flit
Highest-priority flit always routes to destination
143
![Page 144: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/144.jpg)
Fast Deflection Routing with Four Inputs
144
Each block makes decisions independently
Deflection is a distributed decision
N
E
S
W
N
S
E
W
![Page 145: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/145.jpg)
Unnecessary Deflections in Fast Arbiters How does lack of coordination cause unnecessary deflections?
1. No flit is golden (pseudorandom arbitration)
2. Red flit wins at first stage
3. Green flit loses at first stage (must be deflected now)
4. Red flit loses at second stage; Red and Green are deflected
145
Destination
Destination
all flits have
equal priority
unnecessary
deflection!
![Page 146: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/146.jpg)
Improving Deflection Arbitration
Key idea 3: Add a priority level and prioritize one flit to ensure at least one flit is not deflected in each cycle
Highest priority: one Golden Packet in network
Chosen in static round-robin schedule
Ensures correctness
Next-highest priority: one silver flit per router per cycle
Chosen pseudo-randomly & local to one router
Enhances performance
146
![Page 147: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/147.jpg)
Adding A Silver Flit Randomly picking a silver flit ensures one flit is not deflected
1. No flit is golden but Red flit is silver
2. Red flit wins at first stage (silver)
3. Green flit is deflected at first stage
4. Red flit wins at second stage (silver); not deflected
147
Destination
Destination
At least one flit
is not deflected red flit has
higher priority
all flits have
equal priority
![Page 148: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/148.jpg)
Minimally-Buffered Deflection Router
148
Eject Inject
Problem 1: Link Contention
Solution 1: Side Buffer
Problem 2: Ejection Bottleneck
Solution 2: Dual-Width Ejection
Problem 3: Unnecessary Deflections
Solution 3: Two-level priority scheme
![Page 149: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/149.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
149
![Page 150: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/150.jpg)
Outline: This Talk
Motivation
Background: Bufferless Deflection Routing
MinBD: Reducing Deflections
Addressing Link Contention
Addressing the Ejection Bottleneck
Improving Deflection Arbitration
Results
Conclusions
150
![Page 151: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/151.jpg)
Methodology: Simulated System
Chip Multiprocessor Simulation
64-core and 16-core models
Closed-loop core/cache/NoC cycle-level model
Directory cache coherence protocol (SGI Origin-based)
64KB L1, perfect L2 (stresses interconnect), XOR-mapping
Performance metric: Weighted Speedup (similar conclusions from network-level latency)
Workloads: multiprogrammed SPEC CPU2006
75 randomly-chosen workloads
Binned into network-load categories by average injection rate
151
![Page 152: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/152.jpg)
Methodology: Routers and Network
Input-buffered virtual-channel router
8 VCs, 8 flits/VC [Buffered(8,8)]: large buffered router
4 VCs, 4 flits/VC [Buffered(4,4)]: typical buffered router
4 VCs, 1 flit/VC [Buffered(4,1)]: smallest deadlock-free router
All power-of-2 buffer sizes up to (8, 8) for perf/power sweep
Bufferless deflection router: CHIPPER1
Bufferless-buffered hybrid router: AFC2
Has input buffers and deflection routing logic
Performs coarse-grained (multi-cycle) mode switching
Common parameters
2-cycle router latency, 1-cycle link latency
2D-mesh topology (16-node: 4x4; 64-node: 8x8)
Dual ejection assumed for baseline routers (for perf. only)
152 1Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011. 2Jafri et al., “Adaptive Flow Control for Robust Performance and Energy”, MICRO 2010.
![Page 153: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/153.jpg)
Methodology: Power, Die Area, Crit. Path
Hardware modeling
Verilog models for CHIPPER, MinBD, buffered control logic
Synthesized with commercial 65nm library
ORION 2.0 for datapath: crossbar, muxes, buffers and links
Power
Static and dynamic power from hardware models
Based on event counts in cycle-accurate simulations
Broken down into buffer, link, other
153
![Page 154: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/154.jpg)
Deflection
Reduced Deflections & Improved Perf.
154
12
12.5
13
13.5
14
14.5
15
Wei
ghte
d S
pee
du
p
Baseline B (Side-Buf) D (Dual-Eject) S (Silver Flits) B+D B+S+D (MinBD)
(Side Buffer)
Rate
28% 17% 22% 27% 11% 10%
1. All mechanisms individually reduce deflections
2. Side buffer alone is not sufficient for performance (ejection bottleneck remains)
3. Overall, 5.8% over baseline, 2.7% over dual-eject by reducing deflections 64% / 54%
5.8% 2.7%
![Page 155: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/155.jpg)
Overall Performance Results
155
8
10
12
14
16
We
igh
ted
Sp
ee
du
p
Injection Rate
Buffered (8,8)
Buffered (4,4)
Buffered (4,1)
CHIPPER
AFC (4,4)
MinBD-4
• Improves 2.7% over CHIPPER (8.1% at high load) • Similar perf. to Buffered (4,1) @ 25% of buffering space
2.7%
8.1%
2.7%
8.3%
• Within 2.7% of Buffered (4,4) (8.3% at high load)
![Page 156: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/156.jpg)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bu
ffer
ed
(8,8
)
Bu
ffer
ed
(4,4
)
Bu
ffer
ed
(4,1
)
CH
IPP
ER
AFC
(4,4
)
Min
BD
-4
Net
wo
rk P
ow
er
(W)
dynamic static
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bu
ffer
ed
(8,8
)
Bu
ffer
ed
(4,4
)
Bu
ffer
ed
(4,1
)
CH
IPP
ER
AFC
(4,4
)
Min
BD
-4
Net
wo
rk P
ow
er (
W)
non-buffer buffer
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bu
ffer
ed
(8,8
)
Bu
ffer
ed
(4,4
)
Bu
ffer
ed
(4,1
)
CH
IPP
ER
AFC
(4,4
)
Min
BD
-4
Net
wo
rk P
ow
er
(W)
dynamic other dynamic link dynamic buffer static other static link static buffer
Overall Power Results
156
• Buffers are significant fraction of power in baseline routers • Buffer power is much smaller in MinBD (4-flit buffer)
• Dynamic power increases with deflection routing
• Dynamic power reduces in MinBD relative to CHIPPER
![Page 157: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/157.jpg)
Performance-Power Spectrum
157
Buf (1,1)
13.0
13.2
13.4
13.6
13.8
14.0
14.2
14.4
14.6
14.8
15.0
0.5 1.0 1.5 2.0 2.5 3.0
We
igh
ted
Sp
ee
du
p
Network Power (W)
• Most energy-efficient (perf/watt) of any evaluated network router design
Buf (4,4)
Buf (4,1)
More Perf/Power Less Perf/Power
Buf (8,8)
AFC
CHIPPER
MinBD
![Page 158: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/158.jpg)
0
0.5
1
1.5
2
2.5
Bu
ffer
ed (8
,8)
Bu
ffer
ed (4
,4)
Bu
ffer
ed (4
,1)
CH
IPP
ER
Min
BD
Normalized Die Area
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Bu
ffer
ed (8
,8)
Bu
ffer
ed (4
,4)
Bu
ffer
ed (4
,1)
CH
IPP
ER
Min
BD
Normalized Critical Path
Die Area and Critical Path
158
• Only 3% area increase over CHIPPER (4-flit buffer) • Reduces area by 36% from Buffered (4,4) • Increases by 7% over CHIPPER, 8% over Buffered (4,4)
+3%
-36%
+7% +8%
![Page 159: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/159.jpg)
Conclusions Bufferless deflection routing offers reduced power & area
But, high deflection rate hurts performance at high load
MinBD (Minimally-Buffered Deflection Router) introduces:
Side buffer to hold only flits that would have been deflected
Dual-width ejection to address ejection bottleneck
Two-level prioritization to avoid unnecessary deflections
MinBD yields reduced power (31%) & reduced area (36%) relative to buffered routers
MinBD yields improved performance (8.1% at high load) relative to bufferless routers closes half of perf. gap
MinBD has the best energy efficiency of all evaluated designs with competitive performance
159
![Page 160: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/160.jpg)
More Readings
Studies of congestion and congestion control in on-chip vs. internet-like networks
George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx)
George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu, "Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?" Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HOTNETS), Monterey, CA, October 2010. Slides (ppt) (key)
160
![Page 161: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/161.jpg)
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks
Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks"
Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), New York, NY, October 2012. Slides
(pptx) (pdf)
![Page 162: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/162.jpg)
Executive Summary • Problem: Packets contend in on-chip networks (NoCs),
causing congestion, thus reducing performance
• Observations:
1) Some applications are more sensitive to network latency than others 2) Applications must be throttled differently to achieve peak performance
• Key Idea: Heterogeneous Adaptive Throttling (HAT) 1) Application-aware source throttling 2) Network-load-aware throttling rate adjustment
• Result: Improves performance and energy efficiency over state-of-the-art source throttling policies
162
![Page 163: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/163.jpg)
Outline
• Background and Motivation
• Mechanism
• Prior Works
• Results
163
![Page 164: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/164.jpg)
On-Chip Networks
164
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R Router
Processing Element (Cores, L2 Banks, Memory Controllers, etc)
PE
• Connect cores, caches, memory controllers, etc
• Packet switched
• 2D mesh: Most commonly used topology
• Primarily serve cache misses and memory requests
• Router designs
– Buffered: Input buffers to hold contending packets
– Bufferless: Misroute (deflect) contending packets
![Page 165: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/165.jpg)
Network Congestion Reduces Performance
165
Network congestion: Network throughput Application performance R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
P P P
P
P
R Router
Processing Element (Cores, L2 Banks, Memory Controllers, etc)
PE
P Packet
Limited shared resources (buffers and links) • Design constraints: power, chip area, and timing
![Page 166: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/166.jpg)
Goal • Improve performance in a highly congested NoC
• Reducing network load decreases network congestion, hence improves performance
• Approach: source throttling to reduce network load
– Temporarily delay new traffic injection
• Naïve mechanism: throttle every single node
166
![Page 167: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/167.jpg)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
mcf gromacs system
No
rmal
ize
d
Pe
rfo
rman
ce
Throttle gromacs Throttle mcf
Key Observation #1
167
gromacs: network-non-intensive
+ 9% - 2%
Different applications respond differently to changes in network latency
mcf: network-intensive
Throttling mcf reduces congestion gromacs is more sensitive to network latency Throttling network-intensive applications benefits system performance more
![Page 168: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/168.jpg)
6
7
8
9
10
11
12
13
14
15
16
80 82 84 86 88 90 92 94 96 98 100
Pe
rfo
rman
ce
(We
igh
ted
Sp
ee
du
p)
Throttling Rate (%)
Workload 1 Workload 2 Workload 3
Key Observation #2
168
Different workloads achieve peak performance at different throttling rates
Dynamically adjusting throttling rate yields better performance than a single static rate
90% 92%
94%
![Page 169: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/169.jpg)
Outline
• Background and Motivation
• Mechanism
• Prior Works
• Results
169
![Page 170: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/170.jpg)
Heterogeneous Adaptive Throttling (HAT)
1. Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications
2. Network-load-aware throttling rate adjustment: Dynamically adjusts throttling rate to adapt to different workloads
170
![Page 171: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/171.jpg)
Heterogeneous Adaptive Throttling (HAT)
1. Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications
2. Network-load-aware throttling rate adjustment: Dynamically adjusts throttling rate to adapt to different workloads
171
![Page 172: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/172.jpg)
Application-Aware Throttling 1. Measure Network Intensity
Use L1 MPKI (misses per thousand instructions) to estimate network intensity
2. Classify Application
Sort applications by L1 MPKI
3. Throttle network-intensive applications
172
Ap
p
Σ MPKI < NonIntensiveCap
Network-non-intensive Network-intensive
Higher L1 MPKI
![Page 173: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/173.jpg)
Heterogeneous Adaptive Throttling (HAT)
1. Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications
2. Network-load-aware throttling rate adjustment: Dynamically adjusts throttling rate to adapt to different workloads
173
![Page 174: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/174.jpg)
Dynamic Throttling Rate Adjustment
• For a given network design, peak performance tends to occur at a fixed network load point
• Dynamically adjust throttling rate to achieve that network load point
174
![Page 175: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/175.jpg)
Dynamic Throttling Rate Adjustment
• Goal: maintain network load at a peak performance point
1. Measure network load
2. Compare and adjust throttling rate
If network load > peak point:
Increase throttling rate
elif network load ≤ peak point:
Decrease throttling rate
175
![Page 176: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/176.jpg)
Epoch-Based Operation • Continuous HAT operation is expensive
• Solution: performs HAT at epoch granularity
176
Time
Current Epoch (100K cycles)
Next Epoch (100K cycles)
During epoch: 1) Measure L1 MPKI
of each application 2) Measure network
load
Beginning of epoch: 1) Classify applications 2) Adjust throttling rate 3) Reset measurements
![Page 177: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/177.jpg)
Outline
• Background and Motivation
• Mechanism
• Prior Works
• Results
177
![Page 178: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/178.jpg)
Prior Source Throttling Works • Source throttling for bufferless NoCs
[Nychis+ Hotnets’10, SIGCOMM’12]
– Application-aware throttling based on starvation rate
– Does not adaptively adjust throttling rate
– “Heterogeneous Throttling”
• Source throttling off-chip buffered networks [Thottethodi+ HPCA’01]
– Dynamically trigger throttling based on fraction of buffer occupancy
– Not application-aware: fully block packet injections of every node
– “Self-tuned Throttling”
178
![Page 179: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/179.jpg)
Outline
• Background and Motivation
• Mechanism
• Prior Works
• Results
179
![Page 180: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/180.jpg)
Methodology • Chip Multiprocessor Simulator
– 64-node multi-core systems with a 2D-mesh topology
– Closed-loop core/cache/NoC cycle-level model
– 64KB L1, perfect L2 (always hits to stress NoC)
• Router Designs – Virtual-channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92]
– Bufferless deflection routers: BLESS [Moscibroda+ ISCA’09]
• Workloads – 60 multi-core workloads: SPEC CPU2006 benchmarks
– Categorized based on their network intensity
• Low/Medium/High intensity categories
• Metrics: Weighted Speedup (perf.), perf./Watt (energy eff.), and maximum slowdown (fairness)
180
![Page 181: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/181.jpg)
0 5
10 15 20 25 30 35 40 45 50
HL HML HM H amean
We
igh
ted
Sp
ee
du
p
Workload Categories
BLESS
Hetero.
HAT
Performance: Bufferless NoC (BLESS)
181
HAT provides better performance improvement than past work Highest improvement on heterogeneous workload mixes - L and M are more sensitive to network latency
7.4%
![Page 182: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/182.jpg)
0
5
10
15
20
25
30
35
40
45
50
HL HML HM H amean
We
igh
ted
Sp
ee
du
p
Workload Categories
Buffered
Self-Tuned
Hetero.
HAT
Performance: Buffered NoC
182
Congestion is much lower in Buffered NoC, but HAT still provides performance benefit
+ 3.5%
![Page 183: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/183.jpg)
Application Fairness
183
HAT provides better fairness than prior works
0.0
0.2
0.4
0.6
0.8
1.0
1.2
amean
No
rmal
ize
d M
axim
um
Slo
wd
wo
n
BLESS Hetero. HAT
0.0
0.2
0.4
0.6
0.8
1.0
1.2
amean
Buffered Self-Tuned
Hetero. HAT
- 15% - 5%
![Page 184: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/184.jpg)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
BLESS Buffered
No
rmal
ize
d P
erf.
per
Wat
Baseline
HAT
Network Energy Efficiency
184
8.5% 5%
HAT increases energy efficiency by reducing congestion
![Page 185: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/185.jpg)
Other Results in Paper
• Performance on CHIPPER
• Performance on multithreaded workloads
• Parameters sensitivity sweep of HAT
185
![Page 186: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/186.jpg)
Conclusion • Problem: Packets contend in on-chip networks (NoCs),
causing congestion, thus reducing performance
• Observations:
1) Some applications are more sensitive to network latency than others 2) Applications must be throttled differently to achieve peak performance
• Key Idea: Heterogeneous Adaptive Throttling (HAT) 1) Application-aware source throttling 2) Network-load-aware throttling rate adjustment
• Result: Improves performance and energy efficiency over state-of-the-art source throttling policies
186
![Page 187: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/187.jpg)
Application-Aware Packet Scheduling
Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das,
"Application-Aware Prioritization Mechanisms for On-Chip Networks"
Proceedings of the 42nd International Symposium on Microarchitecture
(MICRO), pages 280-291, New York, NY, December 2009. Slides (pptx)
![Page 188: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/188.jpg)
© Onur Mutlu, 2009, 2010
The Problem: Packet Scheduling
188
Network-on-Chip
L2$ L2$ L2$
L2$
Bank
mem
cont
Memory
Controller
P
Accelerator L2$
Bank
L2$
Bank
P P P P P P P
On-chip Network
App1 App2 App N+1 App N
On-chip Network is a critical resource
shared by multiple applications
![Page 189: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/189.jpg)
© Onur Mutlu, 2009, 2010
From East
From West
From North
From South
From PE
VC 0
VC Identifier
VC 1
VC 2
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
R
PE
Crossbar ( 5 x 5 )
To East
To PE
To West
To North
To South
Input Port with Buffers
Control Logic
Crossba
r
R Routers
PE Processing Element (Cores, L2 Banks, Memory Controllers etc)
Routing Unit ( RC )
VC Allocator ( VA )
Switch Allocator (S
A)
The Problem: Packet Scheduling
189
![Page 190: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/190.jpg)
© Onur Mutlu, 2009, 2010
VC 0 Routing Unit
( RC )
VC
Allocator ( VA )
Switch Allocator (S
A)
VC 1
VC 2
From East
From West
From North
From South
From PE
The Problem: Packet Scheduling
190
![Page 191: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/191.jpg)
© Onur Mutlu, 2009, 2010
The Problem: Packet Scheduling
191
Conceptual
View
From East
From West
From North
From South
From PE
VC 0
VC 1
VC 2
App1 App2 App3 App4 App5 App6 App7 App8
VC 0 Routing Unit
( RC )
VC
Allocator ( VA )
Switch
VC 1
VC 2
From East
From West
From North
From South
From PE
Allocator (SA
)
![Page 192: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/192.jpg)
© Onur Mutlu, 2009, 2010
The Problem: Packet Scheduling
192
Sc
hed
ule
r
Conceptual
View
VC 0 Routing Unit
( RC )
VC
Allocator ( VA )
Switch Allocator (SA
)
VC 1
VC 2
From East
From West
From North
From South
From PE
From East
From West
From North
From South
From PE
VC 0
VC 1
VC 2
App1 App2 App3 App4 App5 App6 App7 App8
Which packet to choose?
![Page 193: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/193.jpg)
© Onur Mutlu, 2009, 2010
The Problem: Packet Scheduling
Existing scheduling policies
Round Robin
Age
Problem 1: Local to a router
Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next.
Problem 2: Application oblivious
Treat all applications packets equally
But applications are heterogeneous
Solution: Application-aware global scheduling policies.
193
![Page 194: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/194.jpg)
© Onur Mutlu, 2009, 2010
Motivation: Stall-Time Criticality
Applications are not homogenous
Applications have different criticality with respect to the network
Some applications are network latency sensitive
Some applications are network latency tolerant
Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i.e. NST/packet)
Network Stall Time (NST) is number of cycles the processor stalls waiting for network transactions to complete
194
![Page 195: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/195.jpg)
© Onur Mutlu, 2009, 2010
Motivation: Stall-Time Criticality
Why do applications have different network stall time criticality (STC)?
Memory Level Parallelism (MLP)
Lower MLP leads to higher criticality
Shortest Job First Principle (SJF)
Lower network load leads to higher criticality
195
![Page 196: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/196.jpg)
© Onur Mutlu, 2009, 2010
STC Principle 1: MLP
Observation 1: Packet Latency != Network Stall Time
196
STALL STALL
STALL of Red Packet = 0
LATENCY
LATENCY
LATENCY
Application with high MLP
Compute
![Page 197: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/197.jpg)
© Onur Mutlu, 2009, 2010
STC Principle 1: MLP
Observation 1: Packet Latency != Network Stall Time
Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s
197
STALL STALL
STALL of Red Packet = 0
LATENCY
LATENCY
LATENCY
Application with high MLP
Compute
STALL
LATENCY
STALL
LATENCY
STALL
LATENCY
Application with low MLP
![Page 198: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/198.jpg)
© Onur Mutlu, 2009, 2010
STC Principle 2: Shortest-Job-First
198
4X network slow down
1.2X network slow down
1.3X network slow down
1.6X network slow down
Overall system throughput (weighted speedup) increases by 34%
Running ALONE
Baseline (RR) Scheduling
SJF Scheduling
Light Application Heavy Application
![Page 199: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/199.jpg)
© Onur Mutlu, 2009, 2010
Solution: Application-Aware Policies
Idea
Identify critical applications (i.e. network sensitive applications) and prioritize their packets in each router.
Key components of scheduling policy:
Application Ranking
Packet Batching
Propose low-hardware complexity solution
199
![Page 200: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/200.jpg)
© Onur Mutlu, 2009, 2010
Component 1: Ranking
Ranking distinguishes applications based on Stall Time Criticality (STC)
Periodically rank applications based on STC
Explored many heuristics for estimating STC
Heuristic based on outermost private cache Misses Per Instruction (L1-MPI) is the most effective
Low L1-MPI => high STC => higher rank
Why Misses Per Instruction (L1-MPI)?
Easy to Compute (low complexity)
Stable Metric (unaffected by interference in network)
200
![Page 201: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/201.jpg)
© Onur Mutlu, 2009, 2010
Component 1 : How to Rank?
Execution time is divided into fixed “ranking intervals” Ranking interval is 350,000 cycles
At the end of an interval, each core calculates their L1-MPI and sends it to the Central Decision Logic (CDL)
CDL is located in the central node of mesh
CDL forms a rank order and sends back its rank to each core
Two control packets per core every ranking interval
Ranking order is a “partial order”
Rank formation is not on the critical path
Ranking interval is significantly longer than rank computation time
Cores use older rank values until new ranking is available
201
![Page 202: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/202.jpg)
© Onur Mutlu, 2009, 2010
Component 2: Batching
Problem: Starvation
Prioritizing a higher ranked application can lead to starvation of lower ranked application
Solution: Packet Batching
Network packets are grouped into finite sized batches
Packets of older batches are prioritized over younger batches
Time-Based Batching
New batches are formed in a periodic, synchronous manner across all nodes in the network, every T cycles
202
![Page 203: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/203.jpg)
© Onur Mutlu, 2009, 2010
Putting it all together: STC Scheduling
Policy Before injecting a packet into the network, it is tagged with
Batch ID (3 bits)
Rank ID (3 bits)
Three tier priority structure at routers
Oldest batch first (prevent starvation)
Highest rank first (maximize performance)
Local Round-Robin (final tie breaker)
Simple hardware support: priority arbiters
Global coordinated scheduling
Ranking order and batching order are same across all routers
203
![Page 204: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/204.jpg)
© Onur Mutlu, 2009, 2010
STC Scheduling Example
204
4 8
5
1 7
2
1
6 2
1
3
Router
Sc
hed
ule
r
Inje
ctio
n C
ycle
s
1
2
3
4
5
6
7
8
2 2
3
Batch 2
Batch 1
Batch 0
Applications
![Page 205: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/205.jpg)
© Onur Mutlu, 2009, 2010
STC Scheduling Example
205
4 8
5
1 7
3
2
6 2
2
3
Router
Sc
he
du
ler
Round Robin
3 2 8 7 6
STALL CYCLES Avg
RR 8 6 11 8.3
Age
STC
Time
![Page 206: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/206.jpg)
© Onur Mutlu, 2009, 2010
STC Scheduling Example
206
4 8
5
1 7
3
2
6 2
2
3
Router
Sc
he
du
ler
Round Robin
5 4 3 1 2 2 3 2 8 7 6
Age
3 3 5 4 6 7 8
STALL CYCLES Avg
RR 8 6 11 8.3
Age 4 6 11 7.0
STC
Time
Time
![Page 207: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/207.jpg)
© Onur Mutlu, 2009, 2010
STC Scheduling Example
207
4 8
5
1 7
3
2
6 2
2
3
Router
Sc
he
du
ler
Round Robin
5 4 3 1 2 2 3 2 8 7 6
Age
2 3 3 5 4 6 7 8 1 2 2
STC
3 5 4 6 7 8
STALL CYCLES Avg
RR 8 6 11 8.3
Age 4 6 11 7.0
STC 1 3 11 5.0
Time
Time
Time
Rank order
![Page 208: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/208.jpg)
© Onur Mutlu, 2009, 2010
STC Evaluation Methodology
64-core system x86 processor model based on Intel Pentium M
2 GHz processor, 128-entry instruction window
32KB private L1 and 1MB per core shared L2 caches, 32 miss buffers
4GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers
Detailed Network-on-Chip model 2-stage routers (with speculation and look ahead routing)
Wormhole switching (8 flit data packets)
Virtual channel flow control (6 VCs, 5 flit buffer depth)
8x8 Mesh (128 bit bi-directional channels)
Benchmarks Multiprogrammed scientific, server, desktop workloads (35 applications)
96 workload combinations
208
![Page 209: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/209.jpg)
© Onur Mutlu, 2009, 2010
Comparison to Previous Policies
Round Robin & Age (Oldest-First)
Local and application oblivious
Age is biased towards heavy applications
heavy applications flood the network
higher likelihood of an older packet being from heavy application
Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]
Provides bandwidth fairness at the expense of system performance
Penalizes heavy and bursty applications
Each application gets equal and fixed quota of flits (credits) in each batch.
Heavy application quickly run out of credits after injecting into all active batches & stalls until oldest batch completes and frees up fresh credits.
Underutilization of network resources
209
![Page 210: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/210.jpg)
© Onur Mutlu, 2009, 2010
STC System Performance and Fairness
9.1% improvement in weighted speedup over the best existing policy (averaged across 96 workloads)
210
0.0
0.2
0.4
0.6
0.8
1.0
1.2
No
rmal
ized
Sy
ste
m S
pe
edu
p
LocalRR LocalAge
GSF STC
0
2
4
6
8
10
Ne
two
rk U
nfa
irn
ess
LocalRR LocalAge
GSF STC
![Page 211: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/211.jpg)
© Onur Mutlu, 2009, 2010
Enforcing Operating System Priorities
Existing policies cannot enforce operating system (OS) assigned priorities in Network-on-Chip
Proposed framework can enforce OS assigned priorities
Weight of applications => Ranking of applications
Configurable batching interval based on application weight
211
0
2
4
6
8
10
12
14
16
18
20
LocalRR LocalAge GSF STC
Net
wo
rk S
low
do
wn
xalan-1
xalan-2
xalan-3
xalan-4
xalan-5
xalan-6
xalan-7
xalan-8 0
2
4
6
8
10
12
14
16
18
20
22
LocalRR LocalAge GSF-1-2-2-8 STC-1-2-2-8
Net
wo
rk S
low
do
wn
xalan-weight-1 leslie-weight-2 lbm-weight-2 tpcw-weight-8
W. Speedup 0.49 0.49 0.46 0.52 W. Speedup 0.46 0.44 0.27 0.43
![Page 212: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/212.jpg)
© Onur Mutlu, 2009, 2010
Application Aware Packet Scheduling: Summary
Packet scheduling policies critically impact performance and fairness of NoCs
Existing packet scheduling policies are local and application oblivious
STC is a new, global, application-aware approach to packet scheduling in NoCs
Ranking: differentiates applications based on their criticality
Batching: avoids starvation due to rank-based prioritization
Proposed framework
provides higher system performance and fairness than existing policies
can enforce OS assigned priorities in network-on-chip
212
![Page 213: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/213.jpg)
Slack-Driven Packet Scheduling
Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das,
"Aergia: Exploiting Packet Latency Slack in On-Chip Networks"
Proceedings of the 37th International Symposium on Computer Architecture
(ISCA), pages 106-116, Saint-Malo, France, June 2010. Slides (pptx)
![Page 214: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/214.jpg)
Packet Scheduling in NoC
Existing scheduling policies
Round robin
Age
Problem
Treat all packets equally
Application-oblivious
Packets have different criticality
Packet is critical if latency of a packet affects application’s
performance
Different criticality due to memory level parallelism (MLP)
All packets are not the same…!!!
![Page 215: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/215.jpg)
Latency ( )
MLP Principle
Stall Compute
Latency ( )
Latency ( )
Stall ( ) = 0
Packet Latency != Network Stall Time
Different Packets have different criticality due to MLP
Criticality( ) > Criticality( ) > Criticality( )
![Page 216: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/216.jpg)
Outline
Introduction
Packet Scheduling
Memory Level Parallelism
Ae rgia
Concept of Slack
Estimating Slack
Evaluation
Conclusion
![Page 217: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/217.jpg)
What is Aergia?
Ae rgia is the spirit of laziness in Greek mythology
Some packets can afford to slack!
![Page 218: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/218.jpg)
Outline
Introduction
Packet Scheduling
Memory Level Parallelism
Ae rgia
Concept of Slack
Estimating Slack
Evaluation
Conclusion
![Page 219: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/219.jpg)
Slack of Packets
What is slack of a packet?
Slack of a packet is number of cycles it can be delayed in a router without (significantly) reducing application’s performance
Local network slack
Source of slack: Memory-Level Parallelism (MLP)
Latency of an application’s packet hidden from application due to overlap with latency of pending cache miss requests
Prioritize packets with lower slack
![Page 220: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/220.jpg)
Concept of Slack Instruction
Window
Stall
Network-on-Chip
Load Miss Causes
returns earlier than necessary
Compute
Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops
Execution Time
Packet( ) can be delayed for available slack cycles
without reducing performance!
Causes Load Miss
Latency ( )
Latency ( )
Slack Slack
![Page 221: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/221.jpg)
Prioritizing using Slack
Core A
Core B
Packet Latency Slack
13 hops 0 hops
3 hops 10 hops
10 hops 0 hops
4 hops 6 hops
Causes
Causes Load Miss
Load Miss
Prioritize
Load Miss
Load Miss Causes
Causes
Interference at 3 hops
Slack( ) > Slack ( )
![Page 222: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/222.jpg)
Slack in Applications
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200 250 300 350 400 450 500
Pe
rce
nta
ge
of a
ll P
acke
ts (
%)
Slack in cycles
Gems
50% of packets have 350+ slack cycles
10% of packets have <50 slack cycles
Non-critical
critical
![Page 223: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/223.jpg)
Slack in Applications
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200 250 300 350 400 450 500
Perc
enta
ge o
f all
Packets
(%
)
Slack in cycles
Gems
art
68% of packets have zero slack cycles
![Page 224: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/224.jpg)
Diversity in Slack
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200 250 300 350 400 450 500
Perc
enta
ge o
f all
Packets
(%
)
Slack in cycles
Gems
omnet
tpcw
mcf
bzip2
sjbb
sap
sphinx
deal
barnes
astar
calculix
art
libquantum
sjeng
h264ref
![Page 225: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/225.jpg)
Diversity in Slack
0
10
20
30
40
50
60
70
80
90
100
0 50 100 150 200 250 300 350 400 450 500
Perc
enta
ge o
f all
Packets
(%
)
Slack in cycles
Gems
omnet
tpcw
mcf
bzip2
sjbb
sap
sphinx
deal
barnes
astar
calculix
art
libquantum
sjeng
h264ref
Slack varies between packets of different applications
Slack varies between packets of a single application
![Page 226: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/226.jpg)
Outline
Introduction
Packet Scheduling
Memory Level Parallelism
Ae rgia
Concept of Slack
Estimating Slack
Evaluation
Conclusion
![Page 227: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/227.jpg)
Estimating Slack Priority
Slack (P) = Max (Latencies of P’s Predecessors) – Latency of P
Predecessors(P) are the packets of outstanding cache miss
requests when P is issued
Packet latencies not known when issued
Predicting latency of any packet Q
Higher latency if Q corresponds to an L2 miss
Higher latency if Q has to travel farther number of hops
![Page 228: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/228.jpg)
Slack of P = Maximum Predecessor Latency – Latency of P
Slack(P) =
PredL2: Set if any predecessor packet is servicing L2 miss
MyL2: Set if P is NOT servicing an L2 miss
HopEstimate: Max (# of hops of Predecessors) – hops of P
Estimating Slack Priority
PredL2
(2 bits)
MyL2
(1 bit)
HopEstimate
(2 bits)
![Page 229: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/229.jpg)
Estimating Slack Priority
How to predict L2 hit or miss at core?
Global Branch Predictor based L2 Miss Predictor
Use Pattern History Table and 2-bit saturating counters
Threshold based L2 Miss Predictor
If #L2 misses in “M” misses >= “T” threshold then next load is a L2 miss.
Number of miss predecessors?
List of outstanding L2 Misses
Hops estimate?
Hops => ∆X + ∆ Y distance
Use predecessor list to calculate slack hop estimate
![Page 230: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/230.jpg)
Starvation Avoidance
Problem: Starvation
Prioritizing packets can lead to starvation of lower priority
packets
Solution: Time-Based Packet Batching
New batches are formed at every T cycles
Packets of older batches are prioritized over younger batches
![Page 231: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/231.jpg)
Putting it all together
Tag header of the packet with priority bits before injection
Priority(P)?
P’s batch (highest priority)
P’s Slack
Local Round-Robin (final tie breaker)
PredL2
(2 bits)
MyL2
(1 bit)
HopEstimate
(2 bits)
Batch
(3 bits) Priority (P) =
![Page 232: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/232.jpg)
Outline
Introduction
Packet Scheduling
Memory Level Parallelism
Ae rgia
Concept of Slack
Estimating Slack
Evaluation
Conclusion
![Page 233: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/233.jpg)
Evaluation Methodology 64-core system x86 processor model based on Intel Pentium M
2 GHz processor, 128-entry instruction window
32KB private L1 and 1MB per core shared L2 caches, 32 miss buffers
4GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers
Detailed Network-on-Chip model 2-stage routers (with speculation and look ahead routing)
Wormhole switching (8 flit data packets)
Virtual channel flow control (6 VCs, 5 flit buffer depth)
8x8 Mesh (128 bit bi-directional channels)
Benchmarks Multiprogrammed scientific, server, desktop workloads (35 applications)
96 workload combinations
![Page 234: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/234.jpg)
Qualitative Comparison
Round Robin & Age
Local and application oblivious
Age is biased towards heavy applications
Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]
Provides bandwidth fairness at the expense of system performance
Penalizes heavy and bursty applications
Application-Aware Prioritization Policies (SJF) [Das et al., MICRO 2009]
Shortest-Job-First Principle
Packet scheduling policies which prioritize network sensitive
applications which inject lower load
![Page 235: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/235.jpg)
System Performance
SJF provides 8.9% improvement
in weighted speedup
Ae rgia improves system
throughput by 10.3%
Ae rgia+SJF improves system
throughput by 16.1%
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
No
rmal
ized
Sy
ste
m S
pe
edu
p
Age RR
GSF SJF
Aergia SJF+Aergia
![Page 236: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/236.jpg)
Network Unfairness
SJF does not imbalance
network fairness
Aergia improves network
unfairness by 1.5X
SJF+Aergia improves
network unfairness by 1.3X
0.0
3.0
6.0
9.0
12.0
Ne
two
rk U
nfa
irn
ess
Age RR
GSF SJF
Aergia SJF+Aergia
![Page 237: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/237.jpg)
Conclusions & Future Directions
Packets have different criticality, yet existing packet
scheduling policies treat all packets equally
We propose a new approach to packet scheduling in NoCs
We define Slack as a key measure that characterizes the relative
importance of a packet.
We propose Aergia a novel architecture to accelerate low slack
critical packets
Result
Improves system performance: 16.1%
Improves network fairness: 30.8%
![Page 238: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/238.jpg)
Express-Cube Topologies
Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu,
"Express Cube Topologies for On-Chip Interconnects"
Proceedings of the 15th International Symposium on High-Performance
Computer Architecture (HPCA), pages 163-174, Raleigh, NC, February 2009.
Slides (ppt)
![Page 239: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/239.jpg)
UTCS 239 HPCA '09
2-D Mesh
![Page 240: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/240.jpg)
Pros Low design & layout
complexity
Simple, fast routers
Cons Large diameter
Energy & latency impact
UTCS 240 HPCA '09
2-D Mesh
![Page 241: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/241.jpg)
Pros Multiple terminals
attached to a router node
Fast nearest-neighbor communication via the crossbar
Hop count reduction proportional to concentration degree
Cons Benefits limited by
crossbar complexity
UTCS 241 HPCA '09
Concentration (Balfour & Dally, ICS ‘06)
![Page 242: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/242.jpg)
UTCS 242 HPCA '09
Concentration
Side-effects Fewer channels
Greater channel width
![Page 243: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/243.jpg)
UTCS 243 HPCA ‘09
Replication
CMesh-X2
Benefits Restores bisection
channel count
Restores channel width
Reduced crossbar complexity
![Page 244: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/244.jpg)
UTCS 244 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
Objectives: Improve connectivity
Exploit the wire budget
![Page 245: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/245.jpg)
UTCS 245 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
![Page 246: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/246.jpg)
UTCS 246 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
![Page 247: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/247.jpg)
UTCS 247 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
![Page 248: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/248.jpg)
UTCS 248 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
![Page 249: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/249.jpg)
Pros Excellent connectivity
Low diameter: 2 hops
Cons High channel count: k2/2 per row/column
Low channel utilization
Increased control (arbitration) complexity
UTCS 249 HPCA '09
Flattened Butterfly (Kim et al., Micro
‘07)
![Page 250: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/250.jpg)
UTCS 250 HPCA '09
Multidrop Express Channels (MECS)
Objectives: Connectivity
More scalable channel count
Better channel utilization
![Page 251: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/251.jpg)
UTCS 251 HPCA '09
Multidrop Express Channels (MECS)
![Page 252: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/252.jpg)
UTCS 252 HPCA '09
Multidrop Express Channels (MECS)
![Page 253: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/253.jpg)
UTCS 253 HPCA '09
Multidrop Express Channels (MECS)
![Page 254: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/254.jpg)
UTCS 254 HPCA '09
Multidrop Express Channels (MECS)
![Page 255: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/255.jpg)
UTCS 255 HPCA ‘09
Multidrop Express Channels (MECS)
![Page 256: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/256.jpg)
Pros One-to-many topology
Low diameter: 2 hops
k channels row/column
Asymmetric
Cons Asymmetric
Increased control (arbitration) complexity
UTCS 256 HPCA ‘09
Multidrop Express Channels (MECS)
![Page 257: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/257.jpg)
Partitioning: a GEC Example
UTCS 257 HPCA '09
MECS
MECS-X2
Flattened Butterfly
Partitioned MECS
![Page 258: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/258.jpg)
Analytical Comparison
UTCS 258 HPCA '09
CMesh FBfly MECS
Network Size 64 256 64 256 64 256
Radix (conctr’d) 4 8 4 8 4 8
Diameter 6 14 2 2 2 2
Channel count 2 2 8 32 4 8
Channel width 576 1152 144 72 288 288
Router inputs 4 4 6 14 6 14
Router outputs 4 4 6 14 4 4
![Page 259: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/259.jpg)
Experimental Methodology
Topologies Mesh, CMesh, CMesh-X2, FBFly, MECS, MECS-X2
Network sizes 64 & 256 terminals
Routing DOR, adaptive
Messages 64 & 576 bits
Synthetic traffic Uniform random, bit complement, transpose, self-similar
PARSEC
benchmarks
Blackscholes, Bodytrack, Canneal, Ferret,
Fluidanimate, Freqmine, Vip, x264
Full-system config M5 simulator, Alpha ISA, 64 OOO cores
Energy evaluation Orion + CACTI 6
UTCS 259 HPCA '09
![Page 260: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/260.jpg)
UTCS 260 HPCA '09
64 nodes: Uniform Random
0
10
20
30
40
1 4 7 10 13 16 19 22 25 28 31 34 37 40
Late
ncy
(cyc
les)
injection rate (%)
mesh cmesh cmesh-x2 fbfly mecs mecs-x2
![Page 261: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/261.jpg)
UTCS 261 HPCA '09
256 nodes: Uniform Random
0
10
20
30
40
50
60
70
1 4 7 10 13 16 19 22 25
Late
ncy
(cyc
les)
Injection rate (%)
mesh cmesh-x2 fbfly mecs mecs-x2
![Page 262: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/262.jpg)
UTCS 262 HPCA '09
Energy (100K pkts, Uniform Random)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Ave
rage
pac
ket e
ner
gy (n
J)
Link Energy Router Energy
64 nodes 256 nodes
![Page 263: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/263.jpg)
UTCS 263 HPCA '09
64 Nodes: PARSEC
0
2
4
6
8
10
12
14
16
18
20
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Router Energy Link Energy latency
Blackscholes Canneal Vip
Tota
l ne
two
rk E
ne
rgy
(J)
Avg
pac
ket
late
ncy
(cyc
les)
x264
![Page 264: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/264.jpg)
Summary
MECS A new one-to-many topology
Good fit for planar substrates
Excellent connectivity
Effective wire utilization
Generalized Express Cubes Framework & taxonomy for NOC topologies
Extension of the k-ary n-cube model
Useful for understanding and exploring on-chip interconnect options
Future: expand & formalize
UTCS 264 HPCA '09
![Page 265: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/265.jpg)
Kilo-NoC: Topology-Aware QoS
Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu, "Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for
Scalability and Service Guarantees" Proceedings of the 38th International Symposium on Computer
Architecture (ISCA), San Jose, CA, June 2011. Slides (pptx)
![Page 266: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/266.jpg)
Motivation
Extreme-scale chip-level integration
Cores
Cache banks
Accelerators
I/O logic
Network-on-chip (NOC)
10-100 cores today
1000+ assets in the near future
266
![Page 267: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/267.jpg)
Kilo-NOC requirements
High efficiency
Area
Energy
Good performance
Strong service guarantees (QoS)
267
![Page 268: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/268.jpg)
Topology-Aware QoS
Problem: QoS support in each router is expensive (in terms of buffering, arbitration, bookkeeping)
E.g., Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.
Goal: Provide QoS guarantees at low area and power cost
Idea:
Isolate shared resources in a region of the network, support QoS within that area
Design the topology so that applications can access the region without interference
268
![Page 269: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/269.jpg)
Baseline QOS-enabled CMP
Multiple VMs
sharing a die
269
Shared resources (e.g., memory controllers)
VM-private resources (cores, caches)
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
QOS-enabled router Q
![Page 270: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/270.jpg)
Conventional NOC QOS
Contention scenarios:
Shared resources
memory access
Intra-VM traffic
shared cache access
Inter-VM traffic
VM page sharing
270
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
![Page 271: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/271.jpg)
Conventional NOC QOS
271
Q Q Q Q
Q Q Q Q
Q Q Q Q
Q Q Q Q
VM #1
VM #1
VM #3
VM #2
Contention scenarios:
Shared resources
memory access
Intra-VM traffic
shared cache access
Inter-VM traffic
VM page sharing
Network-wide guarantees without
network-wide QOS support
![Page 272: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/272.jpg)
Kilo-NOC QOS
Insight: leverage rich network connectivity
Naturally reduce interference among flows
Limit the extent of hardware QOS support
Requires a low-diameter topology
This work: Multidrop Express Channels (MECS)
272
Grot et al., HPCA
2009
![Page 273: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/273.jpg)
Dedicated, QOS-enabled regions
Rest of die: QOS-free
Richly-connected topology
Traffic isolation
Special routing rules
Manage interference
273
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
Topology-Aware QOS
![Page 274: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/274.jpg)
Dedicated, QOS-enabled regions
Rest of die: QOS-free
Richly-connected topology
Traffic isolation
Special routing rules
Manage interference
274
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
Topology-Aware QOS
![Page 275: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/275.jpg)
Dedicated, QOS-enabled regions
Rest of die: QOS-free
Richly-connected topology
Traffic isolation
Special routing rules
Manage interference
275
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
Topology-Aware QOS
![Page 276: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/276.jpg)
Dedicated, QOS-enabled regions
Rest of die: QOS-free
Richly-connected topology
Traffic isolation
Special routing rules
Manage interference
276
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
Topology-Aware QOS
![Page 277: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/277.jpg)
Topology-aware QOS support
Limit QOS complexity to a fraction of the die
Optimized flow control
Reduce buffer requirements in QOS-free regions
277
Q
Q
Q
Q
VM #1 VM #2
VM #1
VM #3
Kilo-NOC view
![Page 278: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/278.jpg)
Parameter Value
Technology 15 nm
Vdd 0.7 V
System 1024 tiles: 256 concentrated nodes (64 shared resources)
Networks:
MECS+PVC VC flow control, QOS support (PVC) at each node
MECS+TAQ VC flow control, QOS support only in shared regions
MECS+TAQ+EB EB flow control outside of SRs, Separate Request and Reply networks
K-MECS Proposed organization: TAQ + hybrid flow control
278
![Page 279: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/279.jpg)
279
![Page 280: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/280.jpg)
280
![Page 281: Computer Architecture: Interconnects (Part I)ece740/f13/lib/exe/fetch...Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems Mutlu et](https://reader033.fdocuments.us/reader033/viewer/2022060300/5f0830887e708231d420c8b9/html5/thumbnails/281.jpg)
Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates
Topology-aware QOS
Limits QOS support to a fraction of the die
Leverages low-diameter topologies
Improves NOC area- and energy-efficiency
Provides strong guarantees
281