MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group...

20
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group...

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS

Robert Mullins

Computer Architecture Group

Computer Laboratory

University of Cambridge, UK

2/19

• Future performance gains will primarily come from increasing the number of IP cores in a system not their complexity or operating frequency

• Many reasons:– Diminishing returns from simply scaling what we have– Energy efficiency– Complexity – Fault tolerance– Economics

Communication-Centric Architectures

3/19

On-Chip Networks

• An efficient general purpose chip-wide communication infrastructure is becoming essential

• One flexible networking option is to use packet-switched networks with support for virtual-channels

4/19

The Lochside Router

• Router Architecture– Highly parameterised

implementation– Packet-switched network

with virtual-channel flow-control

– Best case latency is one cycle per network hop.

• Results presented here are from post P&R simulations targeting a 90nm technology

TILE

TrafficGenerator, Debug &

Test

R

Lochside Chip (2004/05) 180nm Technology

5/19

Exploiting Speculation to Reduce Communication Latency

Peh/Dally (2001)

6/19

Exploiting Speculation to Reduce Communication Latency

7/19

• Apply existing power saving techniques to an on-chip network design– e.g. clock and signal gating, gate-level optimisations

etc.– Importance of applying such techniques before

making comparisons• Measure power consumption and provide an

accurate breakdown of where the remaining power is dissipated

• Where is best place to look for future power savings?

Aims of this work

8/19

Measuring and Optimizing Dynamic Power

• Our Test Case– 8mm x 8mm die– 4x4 mesh network– Low-latency routers, best

case latency is one cycle per hop (incl. interconnect)

– 1.2V, 90nm technology– 4 input-buffers/ VC– 4 VC/ input port– 48 x 80-bit network links– 800MHz @ WC PVT

• ~32 FO4 clock period– Results reported at

250MHz

9/19

Interconnect Delay/Energy Trade-offs

• Power dissipated in network links depends on how links are spaced and buffered

• At least a factor of 3 difference in energy consumption over range of potential interconnect options

• Could move to low-swing differential schemes for even greater energy savings

For results we assume min. spaced wires, opt. energy x delay product

10/19

• Clock gating optimisations applied at two levels:– Local Clock Gating

• Automated clock gating within router• Some tuning of RTL involved to maximise

opportunities for synthesis tool

– Router Level Clock Gating• Exploit opportunities to gate clock as it enters the

router• Isolates router’s clock completely, only static

power consumption remains

Clock Gating

11/19

• Clock gating exposes clock tree insertion delay• Need to know early if router will be required• Generate ‘early valid’ signals in neighbouring routers

– Early-valid signals are slightly pessimistic – Based on what is requested not granted

Router-Level Clock Gating

12/19

• Automated signal gating and gate-level power optimisations had minimal impact

• Inserting signal gating logic manually did reduce input FIFO power requirements significantly

• The reported results could be further improved (by 12%) by enabling logic optimisation across module boundaries– This was restricted to accurately determine where

power is dissipated

Gate-Level Optimizations and Signal Gating

13/19

• Simple power optimisations can quarter power requirements + many more opportunities to save power

• Network is ~5% of core area• Perhaps 10% of system power at present• Don’t make comparisons without optimizing power!

Power consumption of a single router and its links

Analysis of Power Consumption

14/19

• 22% Static power, 11% Inter-Router Links• ~1% Global Clock tree• 65% Dynamic Power

– Power Breakdown• ~50% of dynamic power is consumed in local clock

tree and input FIFOs• ~30% on router datapath• ~20% on scheduling and arbitration

– Scheduling is probably more complex than typical implementations due to speculation

Analysis of Power Consumption

15/19

Low-Power On-Chip Networks

• Interconnect and static power set to increase– Many low-power link technologies

• Low-swing differential techniques

– Power gating and other leakage reduction techniques

• Potential power savings begin to require lots of different techniques – no one silver bullet?

16/19

Low-Power On-Chip Networks

• Topology– Don’t want to sacrifice general or at least multi-

purpose nature of our networked SoC– Results suggest higher radix routers and longer

interconnects could reduce power• Probably not a long term solution• Reduces path diversity, bad for fault-tolerance

• Architecture– Scope for minimising memory required to store

precomputed router schedule (particular to our router)– Simpler routers– Single cycle routers reduce power? Speculation for

low-power?

17/19

Supporting Best-Effort (BE) and Guaranteed Services (GS) Efficiently

• Current timing of the datapath and link suggests additional GS data could be routed in the same clock cycle– Allocate datapath/link to GS traffic for first ½ of clock

cycle

• Double capacity of network – Exploit simpler GS circuit-switched routing when

possible– Reduce power

• Very little additional overhead

18/19

• Network system timing issues are interesting– naturally event-driven not synchronous

• Work is investigating placing local data-driven clock generators in each network router– Clock is stretched when no data to be routed– Clock matches rate of incoming data streams – Robust synchronisation solution (true GALS)– Also investigating incorporating power gating support

• See also Distributed Clock Generator – DCG (Fairbanks/Moore)

Clocking On-Chip Networks

19/19

Challenges and Future Work

• These are early results in a much more rigorous study on the power requirements of networked on-chip comummunication– Much more soon!

• Exploiting a general-purpose on-chip network– Exploiting execution diversity to improve energy-efficiency – Multi-use platforms and Virtual-IP – Fault tolerance– Networks of processing elements or networks that process?

• Scope for removing unnecessary interfaces and boundaries• Impact of networking on IP and processor core design

Thank You