Download - An Asynchronous Routing Algorithm for Clos Networks

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

An Asynchronous Routing Algorithm for Clos Networks

Wei Song and Doug Edwards

The Advanced Processor Technologies Group (APT)

School of Computer Science

The University of Manchester

{songw, doug}@cs.man.ac.uk

1

Outline

• Introduction of Clos networks and asynchronous circuits

• Classic Synchronous dispatching algorithms – Random dispatching (RD)

– Concurrent Round-Robin Dispatching (CRRD)

• Asynchronous Dispatching (AD) algorithm

• Implementation

• Outcome

– 32-port Clos network. Setup 6.2 ns, release 3.9 ns.



2

Switching Networks

• Telephone networks

• Asynchronous transfer mode (ATM) and

IP networks

–ATLANTA chip (622Mb/s/port, 1997)

–Petastar Optical Switch (160Gb/s/port, 2003)

• Intra/inter chip interconnect

–High-radix router with many narrow ports

–SDM: delay guaranteed services



3

Asynchronous Circuit

• Asynchronous vs. synchronous

–Low power (clockless)

–Possible performance boost (average latency)

–Tolerance to process variation

• Implementation style

–2-phase vs. 4-phase

–Bundled-data vs. quasi-delay insensitive (QDI)

–4-phase QDI



4

Research Target

• An area-efficient and high-speed switching

network for on-chip networks

–High-radix router

–Low-power consumption

–Tolerance to process variation

–Area efficient



5

Clos Switching Network



6

3 stages: IMs, CMs and OMs

C(n, k, m):k IM/OMsn ports per IM/OMm CMs

LI links between IM/CMLO links between CM/OM

SMS: buffer CMMSM: buffer IM/OMS3: no buffer

n´m

n´m

n´m

k´k

k´k

k´k

mń

mń

mń

IM(0)

IM(k-1)

IM(i)

CM(0)

CM(m-1)

CM(r)

OM(0)

OM(k-1)

OM(j)

Non-Blocking

• Blocking

– An available input/output pairs may not be connected due to internal blocking.

• Strict Non-Blocking (SNB)– All available input/output pairs are connectable.

–

• Rearrangeable Non-Blocking (RNB)– Connecting available input/output pairs may

require internal permutation.

–



7

12 nmn

12 nm

Area Consumption



8

Routing Algorithms

• Optimal algorithms

–Algorithms provide guaranteed results for all matches but with a higher complexity in time and implementation.

• Heuristic algorithms

–Algorithms provide all or partial connections in much lower time complexity.

• Most dynamically reconfigurable Clos

networks use heuristic algorithms



9

Dispatching Algorithms

Every Input/output pair has mpaths. Packets are dispatched evenly to all CMs.

Dispatching algorithm: the algorithm used in IM to CM packet dispatching.



10

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

C(2,3,2)

Random Dispatching (RD)

• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs



11

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs



12

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs



13

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

RD cannot resolve contention in IMs.

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



14

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



15

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



16

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



17

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



18

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD



19

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

CRRD cannot resolve contention in CMs.

Asynchronous Dispatching (AD)

• Difficulties

– Packets arrive asynchronously

– Modules are event-driven

– Orphans are generated by failed requests

• Solution

– Independent algorithms (allocators) run in IMs and CMs

– Failed requests use status feedback from CMs as acknowledge.



20


• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback



21

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback



22

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3


• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback



23

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Comparison of Algorithms

• Random Dispatching (AD)

– Cannot resolve contention in IMs or CMs.

• Concurrent Round-Robin Dispatching (CRRD)– Handle contention in IMs using iterations.

– Cannot resolve contention in CMs.

• Asynchronous Dispatching (AD)– Contention in IMs is handled by asynchronous

arbiter naturally.

– Handle contention in CMs using status feedback.



24

Scheduler Architecture



25

OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

C(4,8,4)

CM Dispatcher/OM Scheduler

CMa: CM ack

CMs: CM status feedback

CMa and CMs are

mutually exclusive



26

Tre

e A

rbite

r

CMr( i, j)

CMr( 0, j)

CMr( 7, j)

CMa( i, j)

CMa( 0, j)

CMa( 7, j)

CMa( i', j)

i' = 0 ~ 7, i' ≠ i

CMa( i, j')

j' = 0 ~ 7, j' ≠ j

CMs( i, j)

C(4,8,4)

OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

IM Dispatcher



27

IM/CM

Match

ArbiterIM/CM

Match

Arbiter

LI Arbiter

LI Arbiter

IMcfg

IMr(0)

IMr(1)

IMr(2)

IMr(3)

ImCmA(7)

ImCmA(0)

CMs(3)

CMs(0)

CMr(3)

CMr(0)

IMa

8

4*44

8

8

CfgG

CMa(0)

CMa(3)8

C(4,8,4)OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

IM/CM Match Arbiter(Multi-resource Arbiter)

• S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, “Modular approach to

multi-resource arbiter design,” ASYNC 2009.



28

Tree Arbiter

Tre

e A

rbite

r

ri cb

ocb

o

ri

rbo

rbi

rbi

rbocbi

cb

o

riri cb

o

rbi

rbocbi

rbo

rbi

cbi

cbi

ci

ci ci

ci

IMr(0, j)

IMr(3, j)

IMrb(0, j)

IMrb(3, j)

CMs(0, j)

CMsb(0, j)

CMs(3, j)

CMsb(3, j)

h

h( j,3

,0)

h

h( j,0

,0)

hh( j,0,3)

hh( j,3,3)

C(4,8,4)

Request Generation



29

h( j, h, r)

IMr( h, j)

CMs( r, j)

CMrMx( r, j, h)

CMrMx( r, j, 0)

CMrMx( r, j, 3)

CMrME( r, j)

LI A

rbite

rCMrME( r, 0)

CMrME( r, 7)

CMr( r, 0)

CMr( r, j)

CMr( r, 7)

CMsb( r, j)

CMrMx(0, j, h)

CMrMx(3, j, h)

IMrb( h, j)

C(4,8,4)

Configuration Generation



30

CMa( r, j)

CMrMx( r, j, h)

CMr( r, j)

cfgMx( r, h, j)

cfgMx( r, h, 0)

cfgMx( r, h, 7)

IMcfg( r, h)

IMa(h)

IMcfg(0, h)

IMcfg(3, h)

C(4,8,4)

Area and Speed



31

Dispatching:setup 4.6 ns release 2.9 ns

OM Scheduler:setup 1.6 ns release 1.0 ns

Total: setup 6.2 ns release 3.9 ns

C(4,8,4)

MATLAB Simulation Model



32

Throughput Analysis



33

C(4,8,4) non-blocking loadRD 55%CRRD (i=4) 70%AD (i=3) 76%

C(4,8,m) non-blocking loadRD (m=7) 74%CRRD (m=7) 82%AD (m=7) 96%

Conclusions

• The first asynchronous routing algorithm for general 3-stage Clos networks.

• Better throughput performance than RD and CRRD

• Future works

–Optimize the Clos structure

–Reduce area and latency



34

Thank you!

Questions?



35

Iterations



36

Uniform Traffic



37

Results after Optimization

• Async

– Setup 3.7 ns

– Reset 3.2 ns

– Period ~ 7 ns (140M)

– 28K Gates

• Sync

– 350 MHz

– Cell time = Iter+1 =5 (70M)

– 22K Gates

• Optimization

– Replace multi-resource arbiter

– Eager request (InpG)

– MUTEX arbiter

– Optimized tree arbiter



38