Prediction Router - research.nii.ac.jp
Transcript of Prediction Router - research.nii.ac.jp
![Page 1: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/1.jpg)
Prediction Router:
Hiroki Matsutani (Keio Univ., Japan)
Michihiro Koibuchi (NII, Japan)
Hideharu Amano (Keio Univ., Japan)
Tsutomu Yoshinaga (UEC, Japan)
Yet another low-latency on-chip router architecture
![Page 2: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/2.jpg)
• Tile architecture – Many cores (e.g., processors & caches) – On-chip interconnection network
Why low-latency router is needed?
Packet switched network
router
[Dally, DAC’01]
router router
router router router
router router router
Router Core
16-core tile architecture
On-chip router affects the performance and cost of the chip
![Page 3: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/3.jpg)
System Topology Routing Switching Flow ctrl MIT RAW 2D mesh (32bit) XY DOR WH, no VC Credit
UPMC SPIN Fat Tree (32bit) Up*/down* WH, no VC Credit
QuickSilver ACM H-Tree (32bit) Up*/down* 1-flit, no VC Credit
UMass Amherst aSOC
2D mesh Shortest-path
Pipelined CS, no VC
Timeslot
Sun T1 Crossbar (128bit)
- - Handshake
Cell BE EIB Ring (128bit) Shortest-path
Pipelined CS, no VC
Credit
TRIPS (operand)
2D mesh (109bit)
YX DOR 1-flit, no VC On/off
TRIPS (on-chip) 2D mesh (128bit)
YX DOR WH, 4 VCs Credit
Intel SCC 2D torus (32bit) XY,YX DOR, odd-even TM
WH, no VC Stall/go
TILE64 iMesh 2D mesh (32bit) XY DOR WH, no VC Credit
Intel 80-core NoC
2-D mesh (32bit)
Source routing
WH, 2 lanes On/off
Number of cores increases (e.g., 64-core or more?)
Their communication latency is a crucial problem
Number of hops increases
Low-latency router architecture has been extensively studied
Why low-latency router is needed?
![Page 4: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/4.jpg)
Outline: Prediction router for low-latency NoC
• Existing low-latency routers – Speculative router
– Look-ahead router
– Bypassing router
• Prediction router – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations – Hit rate, gate count, and energy consumption
– Case study 1: 2-D mesh (small core size)
– Case study 2: 2-D mesh (large core size)
– Case study 3: Fat tree network
![Page 5: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/5.jpg)
Wormhole router: Hardware structure
5x5 CROSSBAR
ARBITER
FIFO
FIFO
FIFO
FIFO
FIFO X+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Routing, arbitration, & switch traversal are performed in a pipeline manner
Input ports Output ports 1) selecting an output channel
2) arbitration for the selected output channel
3) sending the packet to the output channel
GRANT
![Page 6: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/6.jpg)
• At least 3-cycle for traversing a router – RC (Routing computation) – VSA (Virtual channel & switch allocations) – ST (Switch traversal)
• A packet transfer from router (a) to router (c)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
At least 12-cycle for transferring a packet from router (a) to router (c)
SA
SA
SA
SA
SA
SA
SA
SA
SA
VA & SA are speculatively performed in parallel
To perform RC and VSA in parallel, look-ahead routing is used
Pipeline structure: 3-cycle router Speculative router: VA/SA in parallel [Peh,HPCA’01]
![Page 7: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/7.jpg)
• At least 3-cycle for traversing a router – NRC (Next routing computation) – VSA (Virtual channel & switch allocations) – ST (Switch traversal)
NRC VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
NRC NRC
VSA can be performed w/o waiting for NRC
Routing computation for the next hop
Output port of router (i+1) is selected by router i
SA
SA
SA
SA
SA
SA
SA
SA
SA
Look-ahead router:RC/VA in parallel
![Page 8: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/8.jpg)
• At least 2-cycle for traversing a router – NRC + VSA (Next routing computation / arbitrations) – ST (Switch traversal)
NRC
VSA ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9
@Router A
HEAD
DATA 1
DATA 2
DATA 3
NRC
VSA ST
NRC
VSA ST
@Router B @Router C
No dependency between NRC & VSA NRC & VSA in parallel
Typical example of 2-cycle router
Look-ahead router:RC/VA in parallel
At least 9-cycle for transferring a packet from router (a) to router (c) Packing NRC,VSA,ST into a single stage frequency harmed
[Dally’s book,
2004]
![Page 9: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/9.jpg)
3-cycle
• Bypassing between intermediate nodes – E.g., Express VCs
Bypassing router: skip some stages
SRC DST
[Kumar, ISCA’07]
3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle 1-cycle
Bypassed 1-cycle
Bypassed
![Page 10: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/10.jpg)
• Bypassing between intermediate nodes – E.g., Express VCs
• Pipeline bypassing utilizing the regularity of DOR – E.g., Mad postman
• Pipeline stages on frequently used are skipped – E.g., Dynamic fast path
• Pipeline stages on user-specific paths are skipped – E.g., Preferred path – E.g., DBP
Bypassing router: skip some stages
[Kumar, ISCA’07]
[Koibuchi, NOCS’08]
[Michelogiannakis, NOCS’07]
[Park, HOTI’07]
[Izu, PDP’94]
We propose a low-latency router based on multiple predictors
3-cycle
SRC DST 3-cycle 3-cycle
Virtual bypassing paths
3-cycle 3-cycle 1-cycle
Bypassed 1-cycle
Bypassed
![Page 11: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/11.jpg)
• Existing low-latency routers – Speculative router
– Look-ahead router
– Bypassing router
• Prediction router – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations – Hit rate, gate count, and energy consumption
– Case study 1: 2-D mesh (small core size)
– Case study 2: 2-D mesh (large core size)
– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 12: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/12.jpg)
Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-
execution)
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
ELAPSED TIME [CYCLE]
1 2 3 4 5 6 7 8 9 10 11 12
@Router A @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
RC & VSA are skipped if prediction hits 1-cycle transfer
[Yoshinaga,IWIA’06]
[Yoshinaga,IWIA’07]
![Page 13: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/13.jpg)
Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-
execution)
ELAPSED TIME [CYCLE]
[Yoshinaga,IWIA’06]
[Yoshinaga,IWIA’07]
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router B @Router C
HEAD
DATA 1
DATA 2
DATA 3
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 14: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/14.jpg)
Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-
execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST RC VSA ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS @Router C
HEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06]
[Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 15: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/15.jpg)
Prediction router for 1-cycle transfer • Each input channel has predictors • When an input channel is idle,
– Predict an output port to be used (RC pre-execution) – Arbitration to use the predicted port(SA pre-
execution)
ELAPSED TIME [CYCLE]
RC VSA ST
ST
ST
ST
ST ST
ST
ST
ST
1 2 3 4 5 6 7 8 9 10 11 12
MISS HIT
HEAD
DATA 1
DATA 2
DATA 3
ST
ST
ST
HIT
[Yoshinaga,IWIA’06]
[Yoshinaga,IWIA’07]
RC & VSA are skipped if prediction hits 1-cycle transfer
E.g, we can expect 1.6 cycle transfer if 70% of predictions hit
![Page 16: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/16.jpg)
Prediction router: Prediction algorithms
• Efficient predictor is key
• Prediction router – Multiple predictors for each
input channel
– Select one of them in response to a given network environment
Single predictor isn’t enough
[Yoshinaga,IWIA’06]
[Yoshinaga,IWIA’07]
for applications with different traffic patterns
Predictors
A B C
Predictors
A B C
1. Random 2. Static Straight (SS)
An output channel on the same dimension is selected (exploiting the regularity of DOR)
3. Custom User can specify which output channel is accelerated
4. Latest Port (LP) Previously used output channel is selected
5. Finite Context Method (FCM) The most frequently appeared pattern of n -context sequence (n = 0,1,2,…)
6. Sampled Pattern Match (SPM) Pattern matching using a record table
[Burtscher, TC’02]
[Jacquet, TIT’02]
![Page 17: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/17.jpg)
5x5 XBAR
ARBITER
FIFO X+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
1-cycle transfer using the reserved crossbar-port when prediction hits
Basic operation @ Correct prediction
Crossbar is reserved
Idle state: Output port X+ is selected and reserved
1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct
1st cycle: RC is performed The prediction is correct!
2nd cycle: Next flit is transferred to X+ without RC and VSA
![Page 18: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/18.jpg)
5x5 XBAR
ARBITER
FIFO X+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Even with miss prediction, a flit is transferred in 3-cycle as original router
Basic operation @ Miss prediction Idle state: Output port X+ is selected and reserved
1st cycle: Incoming flit is transferred to X+ without RC and VSA
Correct Dead flit
1st cycle: RC is performed The prediction is wrong! (X- is correct)
KILL
Kill signal to X+ is asserted 2nd/3rd cycle: Dead flit is removed; retransmission to the correct port
More energy for retransmission
![Page 19: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/19.jpg)
• Existing low-latency routers – Speculative router
– Look-ahead router
– Bypassing router
• Prediction router – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations – Hit rate, gate count, and energy consumption
– Case study 1: 2-D mesh (small core size)
– Case study 2: 2-D mesh (large core size)
– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 20: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/20.jpg)
Prediction hit rate analysis • Formulas to calculate the prediction hit rates on
– 2-D torus (Random, LP, SS, FCM, and SPM)
– 2-D mesh (Random, LP, SS, FCM, and SPM)
– Fat tree (Random and LRU)
– To forecast which prediction algorithm is suited for a given network environment w/o simulations
• Accuracy of the analytical model is confirmed through simulations
Derivation of the formulas is omitted in this talk
(See “Section 4” of our paper for more detail)
![Page 21: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/21.jpg)
• Existing low-latency routers – Speculative router
– Look-ahead router
– Bypassing router
• Prediction router – Architecture and the prediction algorithms
• Hit rate analysis
• Evaluations – Hit rate, gate count, and energy consumption
– Case study 1: 2-D mesh (small core size)
– Case study 2: 2-D mesh (large core size)
– Case study 3: Fat tree network
Outline: Prediction router for low-latency NoC
![Page 22: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/22.jpg)
Evaluation items
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis) Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
Packet length 4-flit (1-flit: 64 bit)
Switching technique wormhole
Channel buffer size 4-flit / VC
Number of VCs 1 or 2VCs
Cycle / hop (miss) 3 stage
Cycle / hop (hit) 1 stage *Topology and traffic are mentioned later
Table 1: Router & network parameters
CMOS process 65nm
Core voltage 1.20V
Temperature 25C
Table 2: Process library
Design compiler 2006.06
Astro 2007.03
Table 3: CAD tools used
![Page 23: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/23.jpg)
3 case studies of prediction router
Case study 3 Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis) Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
• The most popular network topology
MIT’s RAW [Taylor,ISCA’04]
Intel’s 80-core [Vangal,ISSCC’07]
• Dimension-order routing (XY routing)
Here, we show the results of case studies 1 and 2 together
![Page 24: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/24.jpg)
Case study 1: Zero-load comm.latency C
om
m. la
ten
cy [
cyc
les
]
Network size (k-ary 2-mesh)
• Original router
• Pred router (SS)
• Pred router (100% hit)
Uniform random traffic on
4x4 to 16x16 meshes
35.8% reduced for 8x8 cores
(*) 1-cycle transfer for correct prediction, 3-cycle for wrong prediction
48.2% reduced for 16x16 cores
Simulation results
(analytical model also shows the same result)
More latency reduced (48% for k=16) as network size increases
![Page 25: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/25.jpg)
Case study 2: Hit rate @ 8x8 mesh
• SS: go straight
• LP: the last one
• FCM: frequently used pattern
Pre
dic
tio
n h
it r
ate
[%
]
7 NAS parallel benchmark programs 4 synthesized traffics
Efficient for long straight comm.
![Page 26: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/26.jpg)
Case study 2: Hit rate @ 8x8 mesh
Efficient for short repeated comm.
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight
• LP: the last one
• FCM: frequently used pattern
Efficient for long straight comm.
7 NAS parallel benchmark programs 4 synthesized traffics
![Page 27: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/27.jpg)
Case study 2: Hit rate @ 8x8 mesh
All arounder !
Pre
dic
tio
n h
it r
ate
[%
]
• SS: go straight
• LP: the last one
• FCM: frequently used pattern
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
• Existing bypassing routers use – Only a static or a single bypassing policy
• Prediction router supports – Multiple predictors which can be switched in a cycle – To accelerate a wider range of applications
However, effective bypassing policy depends on traffic patterns…
![Page 28: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/28.jpg)
Case study 2: Area & Energy
• Area (gate count) – Original router – Pred router (SS + LP) – Pred router
(SS+LP+FCM)
• Energy consumption
Router area [kilo gates]
6.4 - 15.9% increased, depending on type and number of predictors
Light-weight (small overhead)
FCM is all-arounder, but requires counters
Verilog-HDL designs
Synthesized with 65nm library
![Page 29: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/29.jpg)
6.4 - 15.9% increased, depending on type and number of predictors
Case study 2: Area & Energy
• Area (gate count) – Original router – Pred router (SS + LP) – Pred router
(SS+LP+FCM)
• Energy consumption – Original router – Pred router (70% hit) – Pred router (100% hit)
Flit switching energy [pJ / bit]
Miss prediction consumes power; 9.5% increased if hit rate is 70%
Latency 35.8%-48.2% saved w/ reasonable area/energy overheads
Router area [kilo gates]
This estimation is pessimistic.
1. More energy consumed in links Effect of router energy overhead is reduced
2. Application will be finished early More energy saved
![Page 30: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/30.jpg)
3 case studies of prediction router
Case study 3 Case study 1 & 2
2-D mesh network Fat tree network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis) Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF
![Page 31: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/31.jpg)
Case study 3: Fat tree network
Up Down
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
![Page 32: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/32.jpg)
1. LRU algorithm
LRU output port is selected for upward transfer
2. LRU + LP algorithm
Plus, LP for downward transfer
Case study 3: Fat tree network
• Comm. latency @uniform – Original router – Pred router (LRU) – Pred router (LRU + LP)
Up Down
C
om
m. la
ten
cy [
cyc
les
]
Network size (# of cores)
Latency 30.7% reduced @ 256-core; Small area overhead (7.8%)
![Page 33: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/33.jpg)
• Prediction router for low-latency NoCs – Multiple predictors, which can be switched in a cycle – Architecture and six prediction algorithms – Analytical model of prediction hit rates
• Evaluations of prediction router – Case study 1 : 2-D mesh (small core size) – Case study 2 : 2-D mesh (large core size) – Case study 3 : Fat tree network
• Results
1. Prediction router can be applied to various NoCs 2. Communication latency reduced with small overheads 3. Prediction router with multiple predictors can
accelerate a wider range of applications
From three case studies
Area overhead: 6.4% (SS+LP)
Energy overhead: 9.5% (worst)
Latency reduction: up to 48%
(from Case studies 1 & 2)
Summary of the prediction router
![Page 34: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/34.jpg)
Thank you
for your attention
It would be very helpful if you would speak slowly. Thank you in advance.
![Page 35: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/35.jpg)
5x5 XBAR
ARBITER
FIFO X+
X-
Y+
Y-
CORE
X+
X-
Y+
Y-
CORE
Predictors
A B C
Prediction router: New modifications
KILL signals
• Predictors for each input channel
• Kill mechanism to remove dead flits
• Two-level arbiter – “Reservation” higher priority – “Tentative reservation” by the pre-execution of VSA
Currently, the critical path is related to the arbiter
![Page 36: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/36.jpg)
• Static scheme – A predictor is selected
by user per application
• Dynamic scheme – A predictor is
adaptively selected
Prediction router: Predictor selection
Predictors
A B C
Application 1 Predictor B
Application 2 Predictor A
Application 3 Predictor C
… …
Configuration table
Simple Pre-analysis is needed
Predictors
A B C
Predictor A 100
Predictor B 80
Predictor C 120
Count up if each predictor hits
A predictor is selected every n cycles (e.g., n =10,000)
Flexible More energy
![Page 37: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/37.jpg)
Case study 1: Router critical path
• RC: Routing comp.
• VSA: Arbitration
• ST: Switch traversal
Original router Pred router (SS)
Sta
ge
de
lay
[FO
4s
]
6.2% critical path delay increased compared with original router
ST can be occurred in these stages of prediction router
![Page 38: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/38.jpg)
Case study 2: Hit rate @ 8x8 mesh
All arounder !
• SS: go straight
• LP: the last one
• FCM: frequently used pattern
• Custom: user-specific path
Efficient for long straight comm.
Efficient for short repeated comm.
7 NAS parallel benchmark programs 4 synthesized traffics
Pre
dic
tio
n h
it r
ate
[%
]
Efficient for simple comm.
![Page 39: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/39.jpg)
Case study 4: Spidergon network
• Spidergon topology – Ring + across links
– Each router has 3-port
– Mesh-like 2-D layout
– Across first routing
[Coppola,ISSOC’04]
• Hit rate @ Uniform
![Page 40: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/40.jpg)
Case study 4: Spidergon network
• Spidergon topology – Ring + across links
– Each router has 3-port
– Mesh-like 2-D layout
– Across first routing
• Hit rate @ Uniform – SS: Go straight – LP: Last used one – FCM: Frequently used one
[Coppola,ISSOC’04]
Network size (# of cores)
P
red
icti
on
hit
ra
te [
%]
Hit rates of SS & FCM are almost the same
High hit rate is achieved (80% for 64core; 94% for 256core)
![Page 41: Prediction Router - research.nii.ac.jp](https://reader035.fdocuments.us/reader035/viewer/2022062222/62a435d24edf2c6dbe58fd47/html5/thumbnails/41.jpg)
4 case studies of prediction router
Case study 3 Case study 4 Case study 1 & 2
2-D mesh network Fat tree network Spidergon network
Hit rate / Comm. latency Area (gate count) Energy cons. [pJ / bit]
How many cycles ?
miss hit hit
hit
Flit-level net simulation
XBAR
FIFO
FIFO
Design compiler(synthesis) Fujitsu 65nm library
Astro (place & route)
NC-Verilog (simulation)
Power compiler
SAIF SDF