An Asynchronous Routing Algorithm for Clos...

38
June 23rd 2010 Advanced Processor Technologies Group The School of Computer Science An Asynchronous Routing Algorithm for Clos Networks Wei Song and Doug Edwards The Advanced Processor Technologies Group (APT) School of Computer Science The University of Manchester {songw, doug}@cs.man.ac.uk 1

Transcript of An Asynchronous Routing Algorithm for Clos...

  • June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    An Asynchronous Routing Algorithm for Clos Networks

    Wei Song and Doug Edwards

    The Advanced Processor Technologies Group (APT)

    School of Computer Science

    The University of Manchester

    {songw, doug}@cs.man.ac.uk

    1

  • Outline

    • Introduction of Clos networks and asynchronous circuits

    • Classic Synchronous dispatching algorithms – Random dispatching (RD)

    – Concurrent Round-Robin Dispatching (CRRD)

    • Asynchronous Dispatching (AD) algorithm

    • Implementation

    • Outcome

    – 32-port Clos network. Setup 6.2 ns, release 3.9 ns.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    2

  • Switching Networks

    • Telephone networks

    • Asynchronous transfer mode (ATM) and

    IP networks

    –ATLANTA chip (622Mb/s/port, 1997)

    –Petastar Optical Switch (160Gb/s/port, 2003)

    • Intra/inter chip interconnect

    –High-radix router with many narrow ports

    –SDM: delay guaranteed services

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    3

  • Asynchronous Circuit

    • Asynchronous vs. synchronous

    –Low power (clockless)

    –Possible performance boost (average latency)

    –Tolerance to process variation

    • Implementation style

    –2-phase vs. 4-phase

    –Bundled-data vs. quasi-delay insensitive (QDI)

    –4-phase QDI

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    4

  • Research Target

    • An area-efficient and high-speed switching

    network for on-chip networks

    –High-radix router

    –Low-power consumption

    –Tolerance to process variation

    –Area efficient

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    5

  • Clos Switching Network

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    6

    3 stages: IMs, CMs and OMs

    C(n, k, m):k IM/OMsn ports per IM/OMm CMs

    LI links between IM/CMLO links between CM/OM

    SMS: buffer CMMSM: buffer IM/OMS3: no buffer

    n´m

    n´m

    n´m

    k´k

    k´k

    k´k

    m´n

    m´n

    m´n

    IM(0)

    IM(k-1)

    IM(i)

    CM(0)

    CM(m-1)

    CM(r)

    OM(0)

    OM(k-1)

    OM(j)

  • Non-Blocking

    • Blocking

    – An available input/output pairs may not be connected due to internal blocking.

    • Strict Non-Blocking (SNB)– All available input/output pairs are connectable.

    • Rearrangeable Non-Blocking (RNB)– Connecting available input/output pairs may

    require internal permutation.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    7

    12 nmn

    12 nm

  • Area Consumption

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    8

  • Routing Algorithms

    • Optimal algorithms

    –Algorithms provide guaranteed results for all matches but with a higher complexity in time and implementation.

    • Heuristic algorithms

    –Algorithms provide all or partial connections in much lower time complexity.

    • Most dynamically reconfigurable Clos

    networks use heuristic algorithms

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    9

  • Dispatching Algorithms

    Every Input/output pair has mpaths. Packets are dispatched evenly to all CMs.

    Dispatching algorithm: the algorithm used in IM to CM packet dispatching.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    10

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    C(2,3,2)

  • Random Dispatching (RD)

    • Phase 1

    – IPs send

    requests to

    LIs

    – LIs select IPs

    • Phase 2

    – CMs select

    LIs

    – Configure

    granted IPs

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    11

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Random Dispatching (RD)

    • Phase 1

    – IPs send

    requests to

    LIs

    – LIs select IPs

    • Phase 2

    – CMs select

    LIs

    – Configure

    granted IPs

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    12

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Random Dispatching (RD)

    • Phase 1

    – IPs send

    requests to

    LIs

    – LIs select IPs

    • Phase 2

    – CMs select

    LIs

    – Configure

    granted IPs

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    13

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

    RD cannot resolve contention in IMs.

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    14

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    15

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    16

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    17

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    18

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Concurrent Round-Robin Dispatching (CRRD)

    • Phase 1

    – IPs request

    – LIs select

    – IPs select

    – Go back

    (iteration)

    • Phase 2

    – Same as RD

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    19

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

    CRRD cannot resolve contention in CMs.

  • Asynchronous Dispatching (AD)

    • Difficulties

    – Packets arrive asynchronously

    – Modules are event-driven

    – Orphans are generated by failed requests

    • Solution

    – Independent algorithms (allocators) run in IMs and CMs

    – Failed requests use status feedback from CMs as acknowledge.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    20

  • Asynchronous Dispatching (AD)

    • IM alg.

    – IPs request

    – LIs select

    – OK? Send data

    – Fail? Go back

    • CM alg.

    – CMs grant LIs

    – Update status

    feedback

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    21

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Asynchronous Dispatching (AD)

    • IM alg.

    – IPs request

    – LIs select

    – OK? Send data

    – Fail? Go back

    • CM alg.

    – CMs grant LIs

    – Update status

    feedback

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    22

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Asynchronous Dispatching (AD)

    • IM alg.

    – IPs request

    – LIs select

    – OK? Send data

    – Fail? Go back

    • CM alg.

    – CMs grant LIs

    – Update status

    feedback

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    23

    IM(0)

    CM(0)

    IM(1)

    OM(0)

    OM(1)

    CM(1)

    CM(2)

    0

    2

    1

    3

  • Comparison of Algorithms

    • Random Dispatching (AD)

    – Cannot resolve contention in IMs or CMs.

    • Concurrent Round-Robin Dispatching (CRRD)– Handle contention in IMs using iterations.

    – Cannot resolve contention in CMs.

    • Asynchronous Dispatching (AD)– Contention in IMs is handled by asynchronous

    arbiter naturally.

    – Handle contention in CMs using status feedback.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    24

  • Scheduler Architecture

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    25

    OM_Sch

    (0)

    IM_Sch(i)

    IM_Sch(7)

    CM_Sch(r)

    CM_Sch(3)

    OM_Sch

    (7)

    OM_Sch

    (j)

    RIM

    IM_Dis

    InpG0

    InpG1

    InpG2

    InpG3

    OMr

    IMr

    OM

    r

    IMr

    RCM

    CM_Dis

    CM_Sch(0)IM_Sch(0)

    req(0,0)

    req(0,1)

    req(0,2)

    req(0,3)

    req(i,0)

    req(i,1)

    req(i,2)

    req(i,3)

    req(7,0)

    req(7,1)

    req(7,2)

    req(7,3)

    C(4,8,4)

  • CM Dispatcher/OM Scheduler

    CMa: CM ack

    CMs: CM status feedback

    CMa and CMs are

    mutually exclusive

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    26

    Tre

    e A

    rbite

    r

    CMr( i, j)

    CMr( 0, j)

    CMr( 7, j)

    CMa( i, j)

    CMa( 0, j)

    CMa( 7, j)

    CMa( i', j)

    i' = 0 ~ 7, i' ≠ i

    CMa( i, j')

    j' = 0 ~ 7, j' ≠ j

    CMs( i, j)

    C(4,8,4)

    OM_Sch

    (0)

    IM_Sch(i)

    IM_Sch(7)

    CM_Sch(r)

    CM_Sch(3)

    OM_Sch

    (7)

    OM_Sch

    (j)

    RIM

    IM_Dis

    InpG0

    InpG1

    InpG2

    InpG3

    OMr

    IMr

    OM

    r

    IMr

    RCM

    CM_Dis

    CM_Sch(0)IM_Sch(0)

    req(0,0)

    req(0,1)

    req(0,2)

    req(0,3)

    req(i,0)

    req(i,1)

    req(i,2)

    req(i,3)

    req(7,0)

    req(7,1)

    req(7,2)

    req(7,3)

  • IM Dispatcher

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    27

    IM/CM

    Match

    ArbiterIM/CM

    Match

    Arbiter

    LI Arbiter

    LI Arbiter

    IMcfg

    IMr(0)

    IMr(1)

    IMr(2)

    IMr(3)

    ImCmA(7)

    ImCmA(0)

    CMs(3)

    CMs(0)

    CMr(3)

    CMr(0)

    IMa

    8

    4*44

    8

    8

    CfgG

    CMa(0)

    CMa(3)8

    C(4,8,4)OM_Sch(0)

    IM_Sch(i)

    IM_Sch(7)

    CM_Sch(r)

    CM_Sch(3)

    OM_Sch

    (7)

    OM_Sch

    (j)

    RIM

    IM_Dis

    InpG0

    InpG1

    InpG2

    InpG3

    OMr

    IMr

    OM

    r

    IMr

    RCM

    CM_Dis

    CM_Sch(0)IM_Sch(0)

    req(0,0)

    req(0,1)

    req(0,2)

    req(0,3)

    req(i,0)

    req(i,1)

    req(i,2)

    req(i,3)

    req(7,0)

    req(7,1)

    req(7,2)

    req(7,3)

  • IM/CM Match Arbiter(Multi-resource Arbiter)

    • S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, “Modular approach to

    multi-resource arbiter design,” ASYNC 2009.

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    28

    Tree Arbiter

    Tre

    e A

    rbite

    r

    ri cbo

    cb

    o

    ri

    rbo

    rbi

    rbi

    rbocbi

    cb

    o

    riri cb

    o

    rbi

    rbocbi

    rbo

    rbi

    cbi

    cbi

    ci

    ci ci

    ci

    IMr(0, j)

    IMr(3, j)

    IMrb(0, j)

    IMrb(3, j)

    CMs(0, j)

    CMsb(0, j)

    CMs(3, j)

    CMsb(3, j)

    h

    h( j,3

    ,0)

    h

    h( j,0

    ,0)

    hh( j,0,3)

    hh( j,3,3)

    C(4,8,4)

  • Request Generation

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    29

    h( j, h, r)

    IMr( h, j)

    CMs( r, j)

    CMrMx( r, j, h)

    CMrMx( r, j, 0)

    CMrMx( r, j, 3)

    CMrME( r, j)

    LI A

    rbite

    rCMrME( r, 0)

    CMrME( r, 7)

    CMr( r, 0)

    CMr( r, j)

    CMr( r, 7)

    CMsb( r, j)

    CMrMx(0, j, h)

    CMrMx(3, j, h)

    IMrb( h, j)

    C(4,8,4)

  • Configuration Generation

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    30

    CMa( r, j)

    CMrMx( r, j, h)

    CMr( r, j)

    cfgMx( r, h, j)

    cfgMx( r, h, 0)

    cfgMx( r, h, 7)

    IMcfg( r, h)

    IMa(h)

    IMcfg(0, h)

    IMcfg(3, h)

    C(4,8,4)

  • Area and Speed

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    31

    Dispatching:setup 4.6 ns release 2.9 ns

    OM Scheduler:setup 1.6 ns release 1.0 ns

    Total: setup 6.2 ns release 3.9 ns

    C(4,8,4)

  • MATLAB Simulation Model

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    32

  • Throughput Analysis

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    33

    C(4,8,4) non-blocking loadRD 55%CRRD (i=4) 70%AD (i=3) 76%

    C(4,8,m) non-blocking loadRD (m=7) 74%CRRD (m=7) 82%AD (m=7) 96%

  • Conclusions

    • The first asynchronous routing algorithm for general 3-stage Clos networks.

    • Better throughput performance than RD and CRRD

    • Future works

    –Optimize the Clos structure

    –Reduce area and latency

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    34

  • Thank you!

    Questions?

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    35

  • Iterations

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    36

  • Uniform Traffic

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    37

  • Results after Optimization

    • Async

    – Setup 3.7 ns

    – Reset 3.2 ns

    – Period ~ 7 ns (140M)

    – 28K Gates

    • Sync

    – 350 MHz

    – Cell time = Iter+1 =5 (70M)

    – 22K Gates

    • Optimization

    – Replace multi-resource arbiter

    – Eager request (InpG)

    – MUTEX arbiter

    – Optimized tree arbiter

    June 23rd 2010Advanced Processor Technologies Group

    The School of Computer Science

    38