Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Aniruddha N. Udipi

with Naveen Muralimanohar*,Rajeev Balasubramonian

University of Utah and *HP Labs

University of Utah 2

Motivation - I

• Future CMPs are likely to be power-limited– On-chip networks consume 20-36% of total chip power– Network power dominated by routers

• Chip design and verification costs are tremendous– Directory-based protocols are complicated and have the inherent

problem of indirection– Snooping-based protocols are well understood and simple to design

• Metal and wiring are cheap and plentiful

• We are no longer pin limited for the interconnection network


Motivation - II

• Future of multi-core computing likely to diverge into two separate tracks

– Mid-range multicore machines for home/office

• 16-64 cores– Many-core machines for

scientific/server applications• 1000s of cores

• Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores

• Design energy-efficient networks for moderate core-counts

VM


Executive Summary

• Elimination of routers leads us back to bus-based networks

• Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity

• Enhancing the life of buses for moderately sized CMPs– Filtered segmented bus, low-swing wiring, address

interleaved buses, page coloring


Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing Wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Baseline Chip and Interconnect Organization


Core L1

L2

• Simple mesh used for illustration here, other options discussed in the paper

• Static-NUCA shared L2, each line has a “home” slice based on its address

Router


Where does energy go in the network?

1.39e-10 J/access

1.56e-11 J/access8X

Router Link Energy estimates based on CACTI 6.0 and Orion 2.0


Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion


What is the solution?

• We are left with.. a bus!• Could we really just use a bus?

• Not really–Too many links activated on

every transaction–Energy gained by

eliminating routers lost by activating more links

– Poor performance due to increased arbitration times and network contention


We can do better..

Useless snoop: Particular cache line not present in any other core

• Segment and filter snoop transactions at intermediate points

• Two types of filters– Out-filter– In-filter

• Reduces number of links activated

• Allows for safe parallelism (serialization happens at the central bus if required)

Filtered Bus


Bus link Filter

Filters

• Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter”

• Each of these is a Counting Bloom Filter

– 2 arrays of 10-bit entries– Subsets of the address bits hashed into

each of these arrays, incremented to add entries, decremented to remove entries

– To test for membership, simply check if entries in both arrays are non-zero

– Compact representation, false positives possible


Bus link In + Out Filter

Out-filter - Case 1


RHome Segment • Bloom filter in every

segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment

• If a line has never left a segment, none of its transactions need to be seen outside

Energy Saved

• Completely localized transaction

• Only home segment activated

Bus link In - FilterActivated bus Activated filter

Out - FilterR – Requested Address

Out-filter – Case 2


Home Segment

R

Update

• If the line is being requested from outside its home segment, transaction has to go out on the central bus

• The out-filter of the home segment is updated appropriately

• The in-filter then takes over

RR R

Bus link

Activated bus Activated filterIn - Filter Out - Filter

R – Requested Address

In-filter


RRR

• Bloom filters keep track of a superset of lines currently present in the segment

• Only broadcast within the local segment if requiredEnergy Saved

Bus link

Activated bus Activated filter

In - Filter Out - Filter

R – Requested Address

Arbitration

• Global arbitration delay is non-trivial for a single bus connecting even 16 cores

• Multi-step arbitration, as required• On every request

– arbitrate for local bus and broadcast– if filter indicates that the transaction is complete, “validate”

broadcast via wired-OR– if not, arbitrate for central bus and hold broadcast in a

single-entry buffer until the central bus is available– at the remote sub-buses, priority is given to requests

originating from the central bus



Outline


Low-swing Wiring

• Differential low-swing wiring up to 10X more energy efficient than regular wiring

• These have less impact on packet-switched networks since routers are the bottleneck anyway

–Amdahl’s law!• Slightly increased latency, more metal requirement



Outline


Address Interleaved Buses

• As core counts increase, increased pressure on the bus due to contention

• At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip

• To shore up performance, increase the number of buses

– different buses handle mutually exclusive addresses– increased metal requirement



Outline


Page Coloring

• OS-assisted page-coloring for L2 cache• We use a simple first-touch approach• Improved locality helps any network, but is especially well-suited for our network because

– More flexibility in page placement– Less negative impact by sub-optimal page

placement– Improves filter behavior



Outline



Methodology

• Virtutech SIMICS full-system simulator– “g-cache” significantly modified to add network models

• CACTI 6.0 and Orion 2.0 for router/link energy computation• 16 cores for most experiments, sensitivity analysis for 32- and

64-core systems• 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2• 200 cycle main memory latency• 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire

Region-Of-Interest/parallel section• Baseline routers - 4 VCs, 8 buffers/VC

Energy Consumption – Address Network


Ring – 20xGrid – 27xFbfly – 31x

Energy Consumption – Data Network


Ring – 2xGrid – 2.5xFbfly – 3x

How does energy consumption reduce?

• Router : Link energy ratio is high enough to significantly impact energy characteristics

• Efficient bloom filters, at 16KB/filter

– Out-filters are 85% accurate (note that there are only false positives, no false negatives)

– In-filters are 90% accurate


Effect of Page Coloring

• More locality• Better filtering

– Out filter accuracy increases from 85% to 97%


System Performance


Ring – 7%Grid – 3%Fbfly – 1%

How does performance improve?

• Two basic reasons– Inherent indirection in directory-based protocols– Deep pipelines in routers increasing the no-load latency

• Avg. latency in bus-based network is 16.4 cycles– Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2

cyc) + Link latency (10.5 cyc)

• Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention

– Link (6 cyc) + Router (9 cyc)


Scaling – 32 Cores – Energy

Average energy reduction of 19X in address network, 3X in data network


32 Cores – Performance

Average 5% drop in performance


Scaling - 64 Cores – Energy

Average reduction of 13X in address network, 2.5X in data network


64 Core - Performance


Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

Router Optimizations


• For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than

– 3.5 X at 16 cores– 4.5X at 32 cores– 7X at 64 cores

• Current energy ratio is approx. 70X


Outline



Related Work

• Packet Switched Networks– Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et

al. (HPCA ’09), TRIPS, TILERA• Hierarchical Networks

– Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09)• Snoop Filtering

– Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08)

• Bus applications in CMPs– Manevich et al. (NOCS ’09)

Key Contributions

• For moderate core counts, buses just work!– Dramatic energy reduction– little or no loss in performance– simple snooping protocols, reduction in design

complexity• Low-swing wiring• Multiple Address Interleaved buses• OS-assisted page coloring• Potential for router optimization



Thank you..

• Questions?

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Documents

Transcript of Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks