A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low...

55
A Low - Latency Asynchronous Interconnection Network with Early Arbitration Resolution Georgios Faldamis Cavium Inc. Weiwei Jiang Dept. of Computer Science Columbia University Gennette Gill D.E. Shaw Research Steven M. Nowick Dept. of Computer Science Columbia University ACM/IEEE Asia and South Pacific Design Automation Conf. (ASP - DAC 14)

Transcript of A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low...

Page 1: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

A Low-Latency Asynchronous Interconnection Network

with Early Arbitration Resolution

Georgios Faldamis

Cavium Inc.

Weiwei JiangDept. of Computer Science

Columbia University

Gennette Gill

D.E. Shaw Research

Steven M. NowickDept. of Computer Science

Columbia University

ACM/IEEE Asia and South Pacific Design Automation Conf. (ASP-DAC 14)

Page 2: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Motivation for Networks-on-Chip

Future of computing is multi-core • 2 to 4 cores are common, 8 to 16 widely available

e.g. Niagara 16-core, Intel 10-core Xeon, AMD 16-core Opteron

• Expected progression: hundreds or thousands of cores

• Trend towards complex systems-on-chip (SoC)

Communication complexity: new limiting factor

NoC design enables orthogonalization of concerns:

• Improves scalability- buses and crossbars unable to deliver desired bandwidth

- global ad-hoc wiring does not scale to large systems

• Provides flexibility- handle pre-scheduled and dynamic traffic

- route around faulty network nodes

• Facilitates design reuse- standard interfaces increase modularity, decrease design time

2

Page 3: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Key Active Research Challenges for NoCs

Power consumption• Will exceed future power budgets by a factor of 10x

- [Owens IEEE Micro-07]

• Global clocks: consume large fraction of overall power

• Complex clock-gating techniques- [Benini et al., TVLSI-02]

Chips partitioned into multiple timing domains

• Difficult to integrate heterogeneous modules

• Dynamic voltage/frequency scaling (DVFS) for lower power- [Ogras/Marculescu DAC-08]

A key performance bottleneck = latency• Latency critical for on-chip memory access

• Important for chip multiprocessors (CMP’s)

3

Page 4: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Potential Advantages of Asynchronous Design

Lower power• No clock power consumed• Idle components consume no dynamic power

- IBM/Columbia FIR filter [Tierno, Singh, Nowick, et al., ISSCC-02]

Greater flexibility/modularity

• Easier integration between multiple timing domains

• Supports reusable components- [Bainbridge/Furber, IEEE Micro-02 Magazine]

- [Dobkin/Ginosar, Async-04]

Lower system latency• No per-router clock synchronization no waiting for clock

- [Sheibanyrad/Greiner et al., IEEE Design & Test ‘08]

- [Horak, Nowick, et al., NOCS-10]

4

Page 5: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Motivation for Our Research

Target = interconnection network for CMP’s• Network between processors and cache memory• GALS NoC: sync/async interfaces + async network

Requires high performance• Low system-level latency

- Lightweight routers for low-latency

• High sustained throughput- Maximize steady-state throughput

Target topology = variant MoT(“Mesh-of-Trees”)

• Tree topologies becoming widely used for CMP’s:- XMT [Balkan/Vishkin et al., Hot Interconnects-07]- Single-cycle network [Rahimi, Benini, et al., DATE-11]- NOC-OUT [Grot, Falsafi, et al., IEEE Micro-12]

Our two main contributions: • High-performance async network with advance arbitration• Detailed comparative evaluation on 8 benchmarks 5

core

s

share

d c

ach

e

Page 6: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Contributions (1)

Mesh-of-Trees (MoT) network with “early arbitration”• Target system-latency bottleneck

• Observe newly-entering traffic

• Perform early arbitration + channel pre-allocation

Net benefit: bypass arbitration logic + pre-opened channel

“Early arbitration” capability in fan-in router nodes• Simple and fast operate as FIFO in many traffic scenarios

Monitoring network:• Rapid advance notification of incoming data

• Fast and lightweight

• Key component for early arbitration

6

Page 7: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Contributions (2)

Detailed experimentation and analysis

• “Early arbitration” network vs. “baseline” and “predictive”

- “baseline”: [Horak/Nowick, NOCS-10]

- “predictive”: [Gill/Nowick, NOCS-11]

• 8 diverse synthetic benchmarks

- represent different network conditions

• Significant latency improvement and comparable throughput

- New vs. baseline: 23-30% latency improvement

- New vs. predictive: 13-38% latency improvement

• Low end-to-end system latency

- ~1.7ns (at 25% load, 90nm): through 6 router nodes + 5 hops

7

Page 8: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Related Work: NoC Acceleration Techniques

Express virtual channels [Kumar/Peh, ISCA-07]

• Selective packets use dedicated fast channels

• Virtually bypass intermediate nodesimprovements only against slow coarse-grained baseline: 3-cycle operation

SMART NoC [Chen/Peh, DATE-13]

• Selective packets traverse multiple hops in one cycle

requires advanced circuit-level techniques + aggressive timing assumptions

Hybrid network [Modarressi/Arjomand, DATE-09]

• A normal packet-switched network + fast circuit-switched network

• Flits can switch between two sub-networks

requires partitioned network (statically-allocated) + large circuit-switched setup time

NoC using “advanced bundles” [Kumar et al., ICCD-07]

• Provides advanced information of flit arrival

• Closer to our approach

“advance bundles” advance only one cycle per hop (unlike our approach) 8

Page 9: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

• Introduction• Background• New Asynchronous MoT Network Overview of the “Early Arbitration” Approach

Monitoring Network

Design of the New Arbitration Node

• Experimental Results

Simulation Setup

Network-Level Results

• Conclusion and Future Work

Outline

9

Page 10: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Background: Mesh-of-Trees (MoT) Variant

Topology basics• Fan-out and fan-in network

“inverse” of classical MoT (Leighton)

• Two node typesRouting: 1 input and 2 output channels

Arbitration: 2 input and 1 output channels

Routing features• Deterministic wormhole routing

Path examples shown in the figure

• No contention between distinct source/sink pairs

Potential performance benefits• Lower latency and higher throughput over 2D-mesh

• Shown to perform well for CMP’s[Balkan/Vishkin, Trans. VLSI, Oct. 09], [Balkan/Vishkin, Hot Interconnects-07]

0

1

2

3

0

1

2

3

10

Page 11: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Background: Two Node Types

Routing primitive• 1 input channel and 2 output handshaking channels

• Route the input to one of the outputs

Arbitration primitive• 2 input and 1 output handshaking channels

• Merge two input streams into one output stream

Routing Primitive

Req0Ack0

Data0

Req1Ack1

Data1

Data

ReqAck

Boolean

Source Routing

1 incoming handshaking

channel2 outgoing

handshaking channels

Arbitration PrimitiveReq1

Ack1

Data1

Req0Ack0

Data0

Data

ReqAck

2 incoming handshaking

channels

1 outgoing handshaking

channel

11

Page 12: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Background: Asynchronous Protocols

Handshaking: transition signaling (two-phase)• Two events per transaction

- Req/Ack toggle

• Merits over level signaling (four-phase): - 1 roundtrip communication per data item

- High throughput and low power

• Challenge of two-phase signaling: - designing lightweight implementations

Data encoding: single-rail bundled data • Standard synchronous single-rail data + extra “bundling” req

• Merits of single-rail bundled data:- low power and very good coding efficiency

- allow to re-use synchronous components

• Challenge: requires matched delay for “bundling req”- one-sided timing constraint: “request” must arrive after data is stable

Sender

Rece

iver

req

ack

First communication

Second communication

req

ack

12

Page 13: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

• Introduction• Background• New Asynchronous MoT Network Overview of the “Early Arbitration” Approach

Monitoring Network

Design of the New Arbitration Node

• Experimental Results

Simulation Setup

Network-Level Results

• Conclusion and Future Work

Outline

13

Page 14: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Overview: Early Arbitration Strategy

Key network bottleneck• System-latency

- bottleneck of arbitration logic in fan-in nodes

Basic strategy = anticipation• Observe newly-entering traffic

• Do early arbitration + channel pre-allocation

Net benefit: bypass arbitration logic

Proposed network• As soon as flit enters network:

- all downstream nodes quickly notified (by a monitoring network)

- fan-in nodes: initiate early arbitration + channel pre-allocation

• When flit arrives at each fan-in node:

- quickly sent out through pre-allocated channel

0

1

2

3

0

1

2

3

New arbitration nodes

Routing nodes (unchanged)

14

Page 15: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

• Introduction• Background• New Asynchronous MoT Network Overview of the “Early Arbitration” Approach

Monitoring Network

Design of the New Arbitration Node

• Experimental Results

Simulation Setup

Network-Level Results

• Conclusion and Future Work

Outline

15

Page 16: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Monitoring Network: Overview

Purpose: rapid advance notification of incoming data

Structure: lightweight shadow replica of MoT network

• Small monitoring control unit attached to each node - i.e. both routing and arbitration

Fast and lightweight

• Implemented by several gates for each control unit

Different role for fan-out and fan-in monitoring• Fan-out: fast forward early notification without using it

• Fan-in: fast forward and use it for early arbitration

16

Page 17: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Monitoring Network: Structure

Structure: a shadow replica of MoT network

• Small and fast monitoring control unit attached for each node

fan-out

root

fan-in

root

Monitoring

Channel

Monitoring

Channel

Monitoring

Channels

Monitoring

Channels

Monitoring

Channels

Monitoring

Channel

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring control attached to each node

x

17

Page 18: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Monitoring Network: Operation

When a flit enters the network• Early notification generated and fast forwarded

fan-out

root

fan-in

root

Monitoring

Channel

Monitoring

Channel

Monitoring

Channels

Monitoring

Channels

Monitoring

Channels

Monitoring

Channel

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Monitoring

Control

Early notification generated at fan-out root

Early notification traces same path as flits

Fan-in nodes pre-allocates the channel

18

Page 19: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

• Introduction• Background• New Asynchronous MoT Network Overview of the “Early Arbitration” Approach

Monitoring Network

Design of the New Arbitration Node

• Experimental Results

Simulation Setup

Network-Level Results

• Conclusion and Future Work

Outline

19

Page 20: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Circuit-Level

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

20

Page 21: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Interfaces

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

2 input data channels

1 output data channel

21

Page 22: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Interfaces (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Monitoring channels: provide advance info. on incoming traffic

22

Page 23: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Structure

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex: resolves arbitration between 2 input channels

23

Page 24: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Structure (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex Input Control: requests/releases Mutex

Key component to enable early arbitration

24

Page 25: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Structure (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Input channel latch + control:

Two functions: (i) enables channel pre-allocation, (ii) flow control

25

Page 26: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Structure (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Monitoring control: fast forwards early notification

26

Page 27: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Key Feature (1)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Early arbitration capability:

Monitoring signals initiate arbitration, before actual flit arrival

27

Page 28: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Key Feature (2)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Highly optimized forward path:

contains only 1 pre-opened latch = FIFO stage

Forward path

Latch pre-opened by early arbitration

28

Page 29: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation: Overview

Two simulations

#1. Single-flit scenario

- friendly case

- illustrate how early arbitration works

#2. Contention between two input channels

- more advanced and adversarial case

- illustrate how to resolve contention

29

Page 30: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #1: Single-Flit

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Step #1: Monitoring signal arrives (well before actual flit)

Quickly forwarded

Initiates early arbitration

30

Page 31: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #1: Single-Flit (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Step #2: Completes early arbitration

Wins arbitration

Opens channel

31

Page 32: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #1: Single-Flit (cont.)

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex

Step #3: Flit arrives and gets through pre-allocated channel

Channel already opened

Flit arrives

Flit sent out

32

Page 33: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Forward Latency: Single-Flit

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex

Forward latency = D-latch + XOR2 gate

Channel already opened

Flit arrives

Flit sent out

33

Page 34: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #2: Contention

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Both monitoring signals arrive almost simultaneously

34

Page 35: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #2: Contention (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Both monitoring signals request mutexAssume channel #0 wins arbitration

Channel #0 wins mutex

Channel #0 Opens

35

Page 36: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #2: Contention (cont.)

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex

Flit on channel #0 arrives and goes through pre-allocated channel

Flit on channel #1 arrives but is blocked

Channel #0 already opened

Both flits arriveChannel #0 flit sent out

Channel #1 is blocked

36

Page 37: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #2: Contention (cont.)

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Channel #0 finally releases mutex channel #1 wins

Channel #1 wins mutex

Channel #1 finally opens

37

Page 38: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Simulation #2: Contention (cont.)

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D QL2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

Mutex

Flit on channel #1 gets through

Channel #1 flit sent out

Channel #1 is now opened

38

Page 39: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Multi-Flit Design

Mutex

Mutex Input Control 0

Mutex Input Control 1

mutex-req0

mutex-req1

onewins output-en

Req-Latch

Control

ED QL1 E

D Q

L2

EDQ

L4

zerowins

EDQ

L3

RS

Q

0

1

mux_select

something-

coming-in-0something-

coming-in-1

Monitor

Control

takeoverpreackout0

preackout1

E

D Q

REG

something-

coming-out

ackin

reqout

dataout

reqin0

reqin1

datain0

datain1

ackout1

ackout0

__

__

end

1

__

__

end

0

Structure largely the same as single-flit design

Different Mutex Input Control: receives “tail flag”

Flit type info:“tail flag”

39

Page 40: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

• Introduction• Background• New Asynchronous MoT Network Overview of the “Early Arbitration” Approach

Monitoring Network

Design of the New Arbitration Node

• Experimental Results

Simulation Setup

Network-Level Results

• Conclusion and Future Work

Outline

40

Page 41: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Experimental Results: Overview

Two levels of evaluation:• Node-level: new arbitration node in isolation• Network-level: 8×8 network with new node

Node-level evaluation: see paper for details

• New arbitration node vs. two previous designs:- Baseline [Horak/Nowick NOCS-10]

- Predictive [Gill/Nowick NOCS-11]

• 90nm ARM standard cells, gate-level SPICE simulation

Network-level evaluation: our focus

• Three 8×8 MoT networks: each has 112 router nodes

- Baseline, Predictive, New

• Modeled in structural technology-mapped Verilog- more accurate model than in [Gill/Nowick NOCS-11]

• 8 synthetic benchmarks: a wide range of traffic patterns41

Page 42: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Benchmarks

8 diverse benchmarks• The same as those in NOCS-11

• Represent different network conditions

Classification

• Three friendly benchmarks:

- (1) Shuffle, (2) Tornado and (7) Single Source broadcast [Dally`03]

- No contention

• Three moderately adversarial benchmarks:

- (4) Simple alternation with overlap

- (5) Random restricted broadcast with partial overlap

- (8) Partial streaming with random interruption

- No contention for some nodes, light or moderate contention for others

• Two most adversarial benchmarks:

- (3) All-to-all random and (6) Hotspot8

- Heavy contention at some nodes 42

Page 43: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Network-Level Latency: Single-Flit Design

Baseline Predictive New

Latency Comparison for 25% Network Load

Moderate to significant improvement over all benchmarks• New vs. baseline: 23-30% improvement

• New vs. predictive: 13-38% improvement

43

Page 44: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Network-Level Latency: Single-Flit Design

Baseline Predictive New

Latency Comparison for 25% Network Load

Perform well for benchmark #3 and #6 (adversarial cases)• Predictive: even worse than baseline (~20% higher latency)

• New: better than baseline (~25% lower latency)

44

Page 45: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Network-Level Latency: Single-Flit Design

Baseline Predictive New

1700

Latency Comparison for 25% Network Load

Excellent latency stability: provides predictable behavior• Network latency = ~1700ps, across all benchmarks

through 6 router nodes + 5 hops

• Important for memory access in CMP’s

45

Page 46: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Network-Level Throughput: Single-Flit Design

Baseline Predictive New

Saturation Throughput

New vs. baseline: improvement up to 17% on 6 benchmarks

New vs. predictive: comparable throughput over all benchmarks

46

Page 47: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Network-Level Results: Multi-Flit Design

#1 baseline

#1 new

#3 new

#3 baseline

Latency vs. Input Rate

Fixed packet length = 3 flits/packet

Results only for benchmark #1 and #3

For both benchmarks: • ~30% latency and ~14% throughput improvement

47

Page 48: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Conclusion

Introduced a MoT network using “early arbitration”

• Address system-latency bottleneck

• Observe newly entering traffic

- via lightweight shadow monitoring network

• Perform early arbitration + channel pre-allocation

Detailed experimentation and analysis

• Significant improvements in system-latency

- New vs. baseline: 23-30% across all benchmarks

- New vs. predictive: up to 38%

48

Page 49: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Future Work

Narrow channel reservation window

• Decrease time between “channel reservation” and “flit arrival”

• Increase network utility

Target different topology

• Extend “early arbitration” to 2D-mesh, Clos network, etc.

Build a complete GALS system

• Add mixed-timing interface connect cores by the network

More experiments

• Real traffic benchmarks

49

Page 50: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Back-up Slides

50

Page 51: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Strategy Comparison: Overview

Three network designs

• Baseline

- [Horak/Nowick, NOCS-10]

- foundation of the research

• Predictive

- [Gill/Nowick, NOCS-11]

- a more recent design

• New

- the proposed design

51

Page 52: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Baseline Arbitration Node: Operation

Input channel 0

Input channel 1Output channela

rb

Flit arrives

Waiting for arbitration to complete

Input channel 0

Input channel 1Output channela

rb

Arbitration resolves

Flit sent out

Step #1:

Step #2:

52

Page 53: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

Predictive Arbitration Node: Operation

Dataoutarb

Waiting for arbitration to complete

Flit sent out

Datain0

Datain1

Something-coming-0

Something-coming-1

Flit arrives

Biased channel: held openFlits sent out without waiting

Dataout

Flit sent out

Datain0

Datain1

Something-coming-0

Something-coming-1

Flit arrives

Non-biased channel: entirely blocked

Default Mode: similar operation as “baseline” design

Biased Mode: optimized for one input channel (by prediction)53

Page 54: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

New Arbitration Node: Operation

Dataoutarb

Advance notification completes arbitration and opens the channel well before actual flit arrival

Datain0

Datain1

Something-coming-0

Something-coming-1

Monitoring arrives

Dataoutarb

Flits sent out without waiting

Flit sent out

Datain0

Datain1

Something-coming-0

Something-coming-1

Flit arrives

Arbitration already done

Step #1:

Step #2:

54

Page 55: A Low-Latency Asynchronous Interconnection Network with ...Requires high performance •Low system-level latency - Lightweight routers for low-latency •High sustained throughput

The Role of Monitoring Network

Predictive design [Gill/Nowick, NOCS-11]

• Facilitates mode change- from optimized (biased) to unoptimized (default) only

• For safety purpose only- plays secondary role

New design• Key component of early arbitration strategy

- directly initiates early arbitration

• For higher performance- especially system-latency

55