CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh...
Transcript of CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh...
![Page 1: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/1.jpg)
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour,
Babak Falsafi, and Giovanni De Micheli
![Page 2: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/2.jpg)
Toward Manycore Tiled Servers Servers workloads • Many clients using common service • Manycore chips to maximize throughput
Tiled organizations inherently scalable • Rely on NoCs for communication
NoCs play pivotal role • Affect access latency of instructions & data • Growing area & power footprints
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
Core
$
© 2012 Stavros Volos
Need efficient NoCs for Server Chips!
![Page 3: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/3.jpg)
Multi-Network NoCs: The Way to Specialization & Efficiency Multi- superior to single-network NoCs • Reduce crossbar area & power • Improve wire utilization
But, multi-network NoC not simple for Servers: • Cache coherence complicates NoC resource allocation • Naïve division of networks across traffic is suboptimal
[Balfour’06]
© 2012 Stavros Volos
How do we build multi-network NoCs for Servers?
![Page 4: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/4.jpg)
Our proposal: CCNoC Bimodal network traffic in server workloads • Short requests & long responses dominate
CCNoC: dual-network NoC for servers • Narrow request and wide response networks • Specialization of router microarchitectures
Compared to homogenous dual-network NoC • 15% less energy • 31% less area • No impact on performance
© 2012 Stavros Volos
![Page 5: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/5.jpg)
Outline • Overview • Why Multi-Network NoCs? • Multi-Network NoCs for Servers • CCNoC • Results • Conclusion
© 2012 Stavros Volos
![Page 6: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/6.jpg)
Why Multi-Network NoCs? Wider networks reduce packet latency
But, crossbar costs can be prohibitive • Area: quadratic in network width • Power: linear in network width • Utilization: poor on short packets
Multiple networks more efficient [Balfour’06] • Reduce area & power for fixed NoC bandwidth • Improve wire utilization
2N-bit wide
lower latency
N-bit wide
N-bit wide
N-bit wide
less area
© 2012 Stavros Volos
Build multi-network NoCs for Servers
![Page 7: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/7.jpg)
But, Servers Rely on Cache Coherence Server software needs shared memory • Software stacks are complex • Shared memory facilitates programming • Enables portability across platforms
Coherence complicates NoC design • Control and data-carrying messages • Multiple message classes to enhance protocol performance
– Need to avoid protocol-level deadlocks
© 2012 Stavros Volos
How to split messages across multiple networks?
![Page 8: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/8.jpg)
Cache Coherence 101: Message Class & Size Glossary
• Block fetch/evict requests – Read, write & upgrade Short (~8 bytes) – Evict dirty block Long (~72 bytes) – Evict clean Short
• Coherence requests – Downgrade, invalidate Short
• Responses – Response with data Long – Acknowledgements Short
Protocol Message Class Network Message Size
© 2012 Stavros Volos
Divide by class, size, or hybrid?
![Page 9: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/9.jpg)
Divide Networks by Size? Can specialize network width • Reduce crossbar area & power
Still need VCs for message classes … to avoid protocol-induced deadlocks • Increase pipeline complexity • Add to storage area & power
3N-bit wide
Short
2N-bit wide
N-bit wide
Long
© 2012 Stavros Volos
Network-wide VC overhead
![Page 10: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/10.jpg)
Can eliminate VCs • Lower complexity and router delay • Lower buffer requirements
But, difficult to specialize network width
… different message sizes within class
Networks may be underutilized • Variation in traffic across classes • Suboptimal designs in cost or performance
Divide Networks by Class?
© 2012 Stavros Volos
Resource over-partitioning
3N-bit wide
N-bit wide
N-bit wide
N-bit wide
![Page 11: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/11.jpg)
Server workloads: Most traffic fetch clean blocks from last-level cache • Instructions: high L1-I miss ratio, read-only • Data: rarely modified (read mostly)
Can Skewed Traffic Help Division? [Hardavellas’09, Ferdman’12]
L1-D
Dire
ctor
y
L1-I
Last-Level Cache
Read request
Instruction request
Dirty Clean
Short
Long
Read response
Evict clean
Instruction response
© 2012 Stavros Volos
Short requests & long responses dominant
![Page 12: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/12.jpg)
Observation (1): Don’t Care About Long Requests!
Long requests: dirty block writebacks • 10% on average; less frequent in server workloads
Writeback latency: • Not on critical path • Hidden through buffers & relaxed models
© 2012 Stavros Volos
Network efficiency for writebacks not important
![Page 13: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/13.jpg)
Observation (2): Short Responses are Rare in Servers!
Short responses: Coherence ACK messages • Instructions are read-only
– No core-to-core coherence traffic
• Data sharing happens beyond L1 residency – Writers rarely modify shared data – Core-to-core coherence traffic infrequent
© 2012 Stavros Volos
Network efficiency for short responses not critical
![Page 14: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/14.jpg)
Characterization of Network Traffic
© 2012 Stavros Volos
![Page 15: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/15.jpg)
Characterization of Network Traffic
Servers exhibit bimodal network traffic
34%
57%
© 2012 Stavros Volos
![Page 16: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/16.jpg)
Leveraging Bimodal Network Traffic Recap: Bimodal network traffic in servers • Short requests (57%), long responses (34%) • Short responses, coherent requests, long requests (9%)
Request
M-bit wide
N-bit wide
Response
© 2012 Stavros Volos
CCNoC: dual-network NoC • Wide response network • Narrow request network
![Page 17: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/17.jpg)
CCNoC Response Network • Wide datapath
– Optimized for long responses
• Wormhole flow control – No virtual channels (only one class) – Reduce cost & complexity
• Two-stage pipeline: XA, XT
Request
M-bit wide
N-bit wide
Response
Wide response network: fast and low-cost © 2012 Stavros Volos
![Page 18: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/18.jpg)
CCNoC Request Network • Narrow datapath
– Optimized for short messages – Reduce crossbar area & power
• Virtual channel (VC) flow control – Avoid protocol-level deadlock among
fetch block & coherence requests
• Standard VC-router pipeline – Three stages: VA, XA, XT
Request
M-bit wide
N-bit wide
Response
Narrow request network: VC cost is low © 2012 Stavros Volos
![Page 19: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/19.jpg)
Outline • Overview • Why Multi-Network NoCs? • Multi-Network NoCs for Servers • CCNoC • Results • Conclusion
© 2012 Stavros Volos
![Page 20: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/20.jpg)
Methodology • Flexus [Wenisch’06]
– Full system simulation – 16-core tiled CMP – MESI protocol
• Server workloads
– OLTP, DSS, Web
• Custom power models
• Wide Mesh – 176 bits, 3 VCs
• Homogeneous
– 2x 88 bits, 3 VCs/network
• Heterogeneous – Short: 64 bits, 3 VCs – Long: 112 bits, 3 VCs
• CCNoC
– Request: 64 bits, 2 VCs – Response: 112 bits, WH
© 2012 Stavros Volos
![Page 21: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/21.jpg)
CCNoC Energy Efficiency
© 2012 Stavros Volos
![Page 22: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/22.jpg)
CCNoC Energy Efficiency
© 2012 Stavros Volos
15-28% less energy
![Page 23: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/23.jpg)
CCNoC Energy Efficiency
© 2012 Stavros Volos
15-28% less energy
![Page 24: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/24.jpg)
CCNoC Efficiency
Significant power savings w/o performance loss
Links (Dynamic) Buffers (Dynamic)Buffers (Leakage) Crossbar (Dynamic)
0.0
0.2
0.4
0.6
0.8
1.0
Syst
em
perf
orm
ance
0.0
0.2
0.4
0.6
0.8
1.0
Pow
er
cons
umpt
ion
up to 44% less buffer
power
up to 40% less crossbar
power
![Page 25: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/25.jpg)
Conclusion Bimodal network traffic in server workloads • Short requests & long responses dominate
CCNoC: dual-network NoC for servers • Narrow request and wide response networks • Specialization of router microarchitectures
Compared to homogenous dual-network NoC • 15% less energy • No impact on performance
© 2012 Stavros Volos
![Page 26: CCNoC - Technical University of Denmark...– OLTP, DSS, Web • Custom power models • Wide Mesh – 176 bits, 3 VCs • Homogeneous – 2x 88 bits, 3 VCs/network • Heterogeneous](https://reader033.fdocuments.us/reader033/viewer/2022060916/60a92ce104a26e2611707ccb/html5/thumbnails/26.jpg)
Thanks!
Questions?
For more information, http://parsa.epfl.ch/visa