June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
An Asynchronous Routing Algorithm for Clos Networks
Wei Song and Doug Edwards
The Advanced Processor Technologies Group (APT)
School of Computer Science
The University of Manchester
{songw, doug}@cs.man.ac.uk
1
Outline
• Introduction of Clos networks and asynchronous circuits
• Classic Synchronous dispatching algorithms – Random dispatching (RD)
– Concurrent Round-Robin Dispatching (CRRD)
• Asynchronous Dispatching (AD) algorithm
• Implementation
• Outcome
– 32-port Clos network. Setup 6.2 ns, release 3.9 ns.
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
2
Switching Networks
• Telephone networks
• Asynchronous transfer mode (ATM) and
IP networks
–ATLANTA chip (622Mb/s/port, 1997)
–Petastar Optical Switch (160Gb/s/port, 2003)
• Intra/inter chip interconnect
–High-radix router with many narrow ports
–SDM: delay guaranteed services
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
3
Asynchronous Circuit
• Asynchronous vs. synchronous
–Low power (clockless)
–Possible performance boost (average latency)
–Tolerance to process variation
• Implementation style
–2-phase vs. 4-phase
–Bundled-data vs. quasi-delay insensitive (QDI)
–4-phase QDI
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
4
Research Target
• An area-efficient and high-speed switching
network for on-chip networks
–High-radix router
–Low-power consumption
–Tolerance to process variation
–Area efficient
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
5
Clos Switching Network
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
6
3 stages: IMs, CMs and OMs
C(n, k, m):k IM/OMsn ports per IM/OMm CMs
LI links between IM/CMLO links between CM/OM
SMS: buffer CMMSM: buffer IM/OMS3: no buffer
n´m
n´m
n´m
k´k
k´k
k´k
m´n
m´n
m´n
IM(0)
IM(k-1)
IM(i)
CM(0)
CM(m-1)
CM(r)
OM(0)
OM(k-1)
OM(j)
Non-Blocking
• Blocking
– An available input/output pairs may not be connected due to internal blocking.
• Strict Non-Blocking (SNB)– All available input/output pairs are connectable.
–
• Rearrangeable Non-Blocking (RNB)– Connecting available input/output pairs may
require internal permutation.
–
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
7
12 nmn
12 nm
Area Consumption
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
8
Routing Algorithms
• Optimal algorithms
–Algorithms provide guaranteed results for all matches but with a higher complexity in time and implementation.
• Heuristic algorithms
–Algorithms provide all or partial connections in much lower time complexity.
• Most dynamically reconfigurable Clos
networks use heuristic algorithms
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
9
Dispatching Algorithms
Every Input/output pair has mpaths. Packets are dispatched evenly to all CMs.
Dispatching algorithm: the algorithm used in IM to CM packet dispatching.
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
10
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
C(2,3,2)
Random Dispatching (RD)
• Phase 1
– IPs send
requests to
LIs
– LIs select IPs
• Phase 2
– CMs select
LIs
– Configure
granted IPs
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
11
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Random Dispatching (RD)
• Phase 1
– IPs send
requests to
LIs
– LIs select IPs
• Phase 2
– CMs select
LIs
– Configure
granted IPs
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
12
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Random Dispatching (RD)
• Phase 1
– IPs send
requests to
LIs
– LIs select IPs
• Phase 2
– CMs select
LIs
– Configure
granted IPs
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
13
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
RD cannot resolve contention in IMs.
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
14
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
15
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
16
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
17
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
18
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Concurrent Round-Robin Dispatching (CRRD)
• Phase 1
– IPs request
– LIs select
– IPs select
– Go back
(iteration)
• Phase 2
– Same as RD
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
19
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
CRRD cannot resolve contention in CMs.
Asynchronous Dispatching (AD)
• Difficulties
– Packets arrive asynchronously
– Modules are event-driven
– Orphans are generated by failed requests
• Solution
– Independent algorithms (allocators) run in IMs and CMs
– Failed requests use status feedback from CMs as acknowledge.
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
20
Asynchronous Dispatching (AD)
• IM alg.
– IPs request
– LIs select
– OK? Send data
– Fail? Go back
• CM alg.
– CMs grant LIs
– Update status
feedback
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
21
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Asynchronous Dispatching (AD)
• IM alg.
– IPs request
– LIs select
– OK? Send data
– Fail? Go back
• CM alg.
– CMs grant LIs
– Update status
feedback
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
22
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Asynchronous Dispatching (AD)
• IM alg.
– IPs request
– LIs select
– OK? Send data
– Fail? Go back
• CM alg.
– CMs grant LIs
– Update status
feedback
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
23
IM(0)
CM(0)
IM(1)
OM(0)
OM(1)
CM(1)
CM(2)
0
2
1
3
Comparison of Algorithms
• Random Dispatching (AD)
– Cannot resolve contention in IMs or CMs.
• Concurrent Round-Robin Dispatching (CRRD)– Handle contention in IMs using iterations.
– Cannot resolve contention in CMs.
• Asynchronous Dispatching (AD)– Contention in IMs is handled by asynchronous
arbiter naturally.
– Handle contention in CMs using status feedback.
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
24
Scheduler Architecture
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
25
OM_Sch
(0)
IM_Sch(i)
IM_Sch(7)
CM_Sch(r)
CM_Sch(3)
OM_Sch
(7)
OM_Sch
(j)
RIM
IM_Dis
InpG0
InpG1
InpG2
InpG3
OMr
IMr
OM
r
IMr
RCM
CM_Dis
CM_Sch(0)IM_Sch(0)
req(0,0)
req(0,1)
req(0,2)
req(0,3)
req(i,0)
req(i,1)
req(i,2)
req(i,3)
req(7,0)
req(7,1)
req(7,2)
req(7,3)
C(4,8,4)
CM Dispatcher/OM Scheduler
CMa: CM ack
CMs: CM status feedback
CMa and CMs are
mutually exclusive
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
26
Tre
e A
rbite
r
CMr( i, j)
CMr( 0, j)
CMr( 7, j)
CMa( i, j)
CMa( 0, j)
CMa( 7, j)
CMa( i', j)
i' = 0 ~ 7, i' ≠ i
CMa( i, j')
j' = 0 ~ 7, j' ≠ j
CMs( i, j)
C(4,8,4)
OM_Sch
(0)
IM_Sch(i)
IM_Sch(7)
CM_Sch(r)
CM_Sch(3)
OM_Sch
(7)
OM_Sch
(j)
RIM
IM_Dis
InpG0
InpG1
InpG2
InpG3
OMr
IMr
OM
r
IMr
RCM
CM_Dis
CM_Sch(0)IM_Sch(0)
req(0,0)
req(0,1)
req(0,2)
req(0,3)
req(i,0)
req(i,1)
req(i,2)
req(i,3)
req(7,0)
req(7,1)
req(7,2)
req(7,3)
IM Dispatcher
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
27
IM/CM
Match
ArbiterIM/CM
Match
Arbiter
LI Arbiter
LI Arbiter
IMcfg
IMr(0)
IMr(1)
IMr(2)
IMr(3)
ImCmA(7)
ImCmA(0)
CMs(3)
CMs(0)
CMr(3)
CMr(0)
IMa
8
4*44
8
8
CfgG
CMa(0)
CMa(3)8
C(4,8,4)OM_Sch
(0)
IM_Sch(i)
IM_Sch(7)
CM_Sch(r)
CM_Sch(3)
OM_Sch
(7)
OM_Sch
(j)
RIM
IM_Dis
InpG0
InpG1
InpG2
InpG3
OMr
IMr
OM
r
IMr
RCM
CM_Dis
CM_Sch(0)IM_Sch(0)
req(0,0)
req(0,1)
req(0,2)
req(0,3)
req(i,0)
req(i,1)
req(i,2)
req(i,3)
req(7,0)
req(7,1)
req(7,2)
req(7,3)
IM/CM Match Arbiter(Multi-resource Arbiter)
• S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, “Modular approach to
multi-resource arbiter design,” ASYNC 2009.
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
28
Tree Arbiter
Tre
e A
rbite
r
ri cb
ocb
o
ri
rbo
rbi
rbi
rbocbi
cb
o
riri cb
o
rbi
rbocbi
rbo
rbi
cbi
cbi
ci
ci ci
ci
IMr(0, j)
IMr(3, j)
IMrb(0, j)
IMrb(3, j)
CMs(0, j)
CMsb(0, j)
CMs(3, j)
CMsb(3, j)
h
h( j,3
,0)
h
h( j,0
,0)
hh( j,0,3)
hh( j,3,3)
C(4,8,4)
Request Generation
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
29
h( j, h, r)
IMr( h, j)
CMs( r, j)
CMrMx( r, j, h)
CMrMx( r, j, 0)
CMrMx( r, j, 3)
CMrME( r, j)
LI A
rbite
rCMrME( r, 0)
CMrME( r, 7)
CMr( r, 0)
CMr( r, j)
CMr( r, 7)
CMsb( r, j)
CMrMx(0, j, h)
CMrMx(3, j, h)
IMrb( h, j)
C(4,8,4)
Configuration Generation
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
30
CMa( r, j)
CMrMx( r, j, h)
CMr( r, j)
cfgMx( r, h, j)
cfgMx( r, h, 0)
cfgMx( r, h, 7)
IMcfg( r, h)
IMa(h)
IMcfg(0, h)
IMcfg(3, h)
C(4,8,4)
Area and Speed
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
31
Dispatching:setup 4.6 ns release 2.9 ns
OM Scheduler:setup 1.6 ns release 1.0 ns
Total: setup 6.2 ns release 3.9 ns
C(4,8,4)
MATLAB Simulation Model
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
32
Throughput Analysis
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
33
C(4,8,4) non-blocking loadRD 55%CRRD (i=4) 70%AD (i=3) 76%
C(4,8,m) non-blocking loadRD (m=7) 74%CRRD (m=7) 82%AD (m=7) 96%
Conclusions
• The first asynchronous routing algorithm for general 3-stage Clos networks.
• Better throughput performance than RD and CRRD
• Future works
–Optimize the Clos structure
–Reduce area and latency
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
34
Thank you!
Questions?
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
35
Iterations
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
36
Uniform Traffic
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
37
Results after Optimization
• Async
– Setup 3.7 ns
– Reset 3.2 ns
– Period ~ 7 ns (140M)
– 28K Gates
• Sync
– 350 MHz
– Cell time = Iter+1 =5 (70M)
– 22K Gates
• Optimization
– Replace multi-resource arbiter
– Eager request (InpG)
– MUTEX arbiter
– Optimized tree arbiter
June 23rd 2010Advanced Processor Technologies Group
The School of Computer Science
38
Top Related