Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

39
Towards Scalable, Energy- Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

description

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks. Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs. Motivation - I. Future CMPs are likely to be power-limited On-chip networks consume 20-36% of total chip power - PowerPoint PPT Presentation

Transcript of Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Page 1: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks

Aniruddha N. Udipi

with Naveen Muralimanohar*,Rajeev Balasubramonian

University of Utah and *HP Labs

Page 2: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 2

Motivation - I

• Future CMPs are likely to be power-limited– On-chip networks consume 20-36% of total chip power– Network power dominated by routers

• Chip design and verification costs are tremendous– Directory-based protocols are complicated and have the inherent

problem of indirection– Snooping-based protocols are well understood and simple to design

• Metal and wiring are cheap and plentiful

• We are no longer pin limited for the interconnection network

Page 3: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 3

Motivation - II

• Future of multi-core computing likely to diverge into two separate tracks

– Mid-range multicore machines for home/office

• 16-64 cores– Many-core machines for

scientific/server applications• 1000s of cores

• Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores

• Design energy-efficient networks for moderate core-counts

VM

Page 4: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 4

Executive Summary

• Elimination of routers leads us back to bus-based networks

• Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity

• Enhancing the life of buses for moderately sized CMPs– Filtered segmented bus, low-swing wiring, address

interleaved buses, page coloring

Page 5: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 5

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing Wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 6: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Baseline Chip and Interconnect Organization

University of Utah 6

Core L1

L2

• Simple mesh used for illustration here, other options discussed in the paper

• Static-NUCA shared L2, each line has a “home” slice based on its address

Router

Page 7: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 7

Where does energy go in the network?

1.39e-10 J/access

1.56e-11 J/access8X

Router Link Energy estimates based on CACTI 6.0 and Orion 2.0

Page 8: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 8

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 9: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 9

What is the solution?

• We are left with.. a bus!• Could we really just use a bus?

• Not really–Too many links activated on

every transaction–Energy gained by

eliminating routers lost by activating more links

– Poor performance due to increased arbitration times and network contention

Page 10: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 10

We can do better..

Useless snoop: Particular cache line not present in any other core

Page 11: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

• Segment and filter snoop transactions at intermediate points

• Two types of filters– Out-filter– In-filter

• Reduces number of links activated

• Allows for safe parallelism (serialization happens at the central bus if required)

Filtered Bus

University of Utah 11

Bus link Filter

Page 12: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Filters

• Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter”

• Each of these is a Counting Bloom Filter

– 2 arrays of 10-bit entries– Subsets of the address bits hashed into

each of these arrays, incremented to add entries, decremented to remove entries

– To test for membership, simply check if entries in both arrays are non-zero

– Compact representation, false positives possible

University of Utah 12

Bus link In + Out Filter

Page 13: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Out-filter - Case 1

University of Utah 13

RHome Segment • Bloom filter in every

segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment

• If a line has never left a segment, none of its transactions need to be seen outside

Energy Saved

• Completely localized transaction

• Only home segment activated

Bus link In - FilterActivated bus Activated filter

Out - FilterR – Requested Address

Page 14: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Out-filter – Case 2

University of Utah 14

Home Segment

R

Update

• If the line is being requested from outside its home segment, transaction has to go out on the central bus

• The out-filter of the home segment is updated appropriately

• The in-filter then takes over

RR R

Bus link

Activated bus Activated filterIn - Filter Out - Filter

R – Requested Address

Page 15: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

In-filter

University of Utah 15

RRR

• Bloom filters keep track of a superset of lines currently present in the segment

• Only broadcast within the local segment if requiredEnergy Saved

Bus link

Activated bus Activated filter

In - Filter Out - Filter

R – Requested Address

Page 16: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Arbitration

• Global arbitration delay is non-trivial for a single bus connecting even 16 cores

• Multi-step arbitration, as required• On every request

– arbitrate for local bus and broadcast– if filter indicates that the transaction is complete, “validate”

broadcast via wired-OR– if not, arbitrate for central bus and hold broadcast in a

single-entry buffer until the central bus is available– at the remote sub-buses, priority is given to requests

originating from the central bus

University of Utah 16

Page 17: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 17

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 18: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Low-swing Wiring

• Differential low-swing wiring up to 10X more energy efficient than regular wiring

• These have less impact on packet-switched networks since routers are the bottleneck anyway

–Amdahl’s law!• Slightly increased latency, more metal requirement

University of Utah 18

Page 19: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 19

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 20: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Address Interleaved Buses

• As core counts increase, increased pressure on the bus due to contention

• At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip

• To shore up performance, increase the number of buses

– different buses handle mutually exclusive addresses– increased metal requirement

University of Utah 20

Page 21: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 21

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 22: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Page Coloring

• OS-assisted page-coloring for L2 cache• We use a simple first-touch approach• Improved locality helps any network, but is especially well-suited for our network because

– More flexibility in page placement– Less negative impact by sub-optimal page

placement– Improves filter behavior

University of Utah 22

Page 23: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 23

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 24: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 24

Methodology

• Virtutech SIMICS full-system simulator– “g-cache” significantly modified to add network models

• CACTI 6.0 and Orion 2.0 for router/link energy computation• 16 cores for most experiments, sensitivity analysis for 32- and

64-core systems• 32nm process, 3GHz clock • 32K D-L1, 16K I-L1, 2MB/slice shared L2• 200 cycle main memory latency• 4KB page size • PARSEC, NAS, SPLASH-2 benchmark suites – run for entire

Region-Of-Interest/parallel section• Baseline routers - 4 VCs, 8 buffers/VC

Page 25: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Energy Consumption – Address Network

University of Utah 25

Ring – 20xGrid – 27xFbfly – 31x

Page 26: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Energy Consumption – Data Network

University of Utah 26

Ring – 2xGrid – 2.5xFbfly – 3x

Page 27: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

How does energy consumption reduce?

• Router : Link energy ratio is high enough to significantly impact energy characteristics

• Efficient bloom filters, at 16KB/filter

– Out-filters are 85% accurate (note that there are only false positives, no false negatives)

– In-filters are 90% accurate

University of Utah 27

Page 28: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Effect of Page Coloring

• More locality• Better filtering

– Out filter accuracy increases from 85% to 97%

University of Utah 28

Page 29: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

System Performance

University of Utah 29

Ring – 7%Grid – 3%Fbfly – 1%

Page 30: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

How does performance improve?

• Two basic reasons– Inherent indirection in directory-based protocols– Deep pipelines in routers increasing the no-load latency

• Avg. latency in bus-based network is 16.4 cycles– Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2

cyc) + Link latency (10.5 cyc)

• Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention

– Link (6 cyc) + Router (9 cyc)

University of Utah 30

Page 31: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Scaling – 32 Cores – Energy

Average energy reduction of 19X in address network, 3X in data network

University of Utah 31

Page 32: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

32 Cores – Performance

Average 5% drop in performance

University of Utah 32

Page 33: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Scaling - 64 Cores – Energy

Average reduction of 13X in address network, 2.5X in data network

University of Utah 33

Page 34: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

64 Core - Performance

University of Utah 34

Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

Page 35: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Router Optimizations

University of Utah 35

• For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than

– 3.5 X at 16 cores– 4.5X at 32 cores– 7X at 64 cores

• Current energy ratio is approx. 70X

Page 36: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 36

Outline

• Overview• Proposal I - Filtered Segmented Bus• Proposal II - Low-swing wiring• Proposal III - Address Interleaved Buses• Proposal IV - Page Coloring• Evaluation• Conclusion

Page 37: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 37

Related Work

• Packet Switched Networks– Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et

al. (HPCA ’09), TRIPS, TILERA• Hierarchical Networks

– Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09)• Snoop Filtering

– Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08)

• Bus applications in CMPs– Manevich et al. (NOCS ’09)

Page 38: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

Key Contributions

• For moderate core counts, buses just work!– Dramatic energy reduction– little or no loss in performance– simple snooping protocols, reduction in design

complexity• Low-swing wiring• Multiple Address Interleaved buses• OS-assisted page coloring• Potential for router optimization

University of Utah 38

Page 39: Towards Scalable, Energy-Efficient,       Bus-Based On-Chip Networks

University of Utah 39

Thank you..

• Questions?