Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect
Jungju Oh, Alenka Zajic, Milos Prvulovic
2/23
Contents• Introduction• Hybrid Network
– Low-Latency Transmission Line Ring– Traffic Steering
• Evaluation• Result• Conclusion
3/23
Introduction• On-chip communication latency is increasing
• Broadcast interconnect– Insufficient bandwidth and delay for many-core – Growing core counts → contention– Growing core counts → longer wire
→ larger wire capacitance → longer delay– Unfavorable wire delay with technology scaling
• Packet-switched on-chip network (OCN)+ Short links → fast communication between adjacent nodes+ Scalable aggregated bandwidth– Packets travel many links and pipelined routers– Growing core counts → increasing hop counts/latency for far-apart cores
51015202530350
102030405060708090
Technology (nm)
Dela
y fo
r 1 m
m (n
s)
ITRS 2012
4/23
Motivation• Switched on-chip network
– Good latency for local traffic, but not for long-distance traffic– Much more local than long-distance traffic
• Broadcast interconnect– Avoids routing latency even for long-distance traffic– Cannot handle much traffic
2 4 6 8 10 12 1405
101520253035404550
0%
2%
4%
6%
8%
10%
12%
14%
16%
Distance (hops)
Late
ncy
Traffi
c
5/23
Hybrid Network• Exploit the strengths
– Broadcast on Transmission Line: low latency– Switched on-chip network: throughput
• … alleviate weakness– Limited TL throughput – use only for critical and/or long-distance traffic– High switching overhead for long-distance traffic – use TL
• Two critical components to this work– Transmission Line Broadcast Interconnect – the Why and the How– Traffic Steering – which messages use which interconnect
6/23
Transmission Line• Why Transmission Line?
– Extremely fast propagation• Use electromagnetic wave for signal propagation
– 0.0075 ns/mm (unrepeated wire: 0.54 ns/mm)– Not affected by technology scaling
– But expensive in terms of metal area (20 µm-wide vs. 0.135 µm global wire)• Limited throughput
Transmission Line
Traditioanl Wire
Ground
4.193 µm
4.571 µm
8.457 µm4.1 µm
16 µm
vs. …0.135 µm
TL Traditional Global Wire
7/23
Transmission Line Ring• Transmission Line
– Extremely fast propagation– But expensive in terms of metal area
• Why Ring?– Minimizes overall TL cost– Allows fast arbitration (token passing)
8/23
Unidirectional Transmission Line Ring• Two major problems with TL caused by many connections in many-core
– Attenuation of signal (power split at connections)– Signal reflections/reverberations (discontinuity at connections)– Signal needs to stay stronger than sum of noise and reverberations!
• Unidirectional Transmission Line (UTL) ring makes it easy to design– Chained directional couplers in a ring shape– Control of attenuation– Almost no reflected signal
• Directional Coupler– Two TL lines running in parallel
Transmission Line
9/23
Unidirectional Transmission Line Ring
• Directional Coupler– Two TL lines running in parallel– Signal into one end ①
• Most comes out on other end ②• But some is transferred (EM-coupled) to same direction on other line ③
– Directivity: (almost) no signal on ④– Chain couplers using one line, use the other to connect transmitters/receivers
① ②
③④
Transmission Line
Core 2Rx2 Tx2
Core 1Rx1 Tx1
×
10/23
Using the UTL Ring• Simple receiver/transmitter
– Simple modulation: on-off keying– 1 bit = one or more consecutive pulses
• How fast can we transfer?– Depends on available spectrum of the transmission medium– UTL coupler: 20–60 GHz– 40 GHz clock, 2 pulses/bit → 20 Gbps
• Transmitter– PLL (pulses)– Pass-gate (on/off pulses)– Amplifier (impedance matching)
• Receiver– Pulse detector,– Shift register (collect high rate bits)
PLL
Amp
Data
Transmitter
Detector
Data
Receiver
Shift register
11/23
2 4 6 8 10 12 14 1605
101520253035404550
OCNTL
Distance
Late
ncy
Traffic Steering• Which packet should use which network?• Static steering
– E.g. >8 hops go to TL, rest goes on mesh– Lacks adaptivity
• When traffic low, 8-hop, 7-hop, etc. could benefit from ring• When traffic high, ring can become saturated
12/23
Adaptive Steering• Ring-Affinity Score
– More hops more benefit from using the ring– Non-critical packet no benefit– Ring Affinity Score = latency difference plus criticality adjustment
• Threshold– Score above threshold use ring– Adjust threshold to prevent ring bandwidth saturation
• Too much traffic on the ring queuing delays all benefit dissapears
13/23
Ring-Affinity Score• Score • : criticality adjustment
– Constant penalty to non-critical coherence messages for simplification• (latency benefit)
– : latency estimate for mesh– : latency estimate for UTL ring
• How to get ?– Depends on packet’s hop count, mesh network congestion
• Tried using just hop count times router latency, not good enough!– Small cache in each node, stores recent latencies for given hop count
• E.g. 8x8 mesh 15 hop counts 15 sets in the latency cache• Each set keeps most recently observed latencies• Predictor chooses between using just the most recent latency, the average
of latest latencies, or the average of all () latencies
14/23
Ring Affinity Scoring• Estimating
– How long to transmit? Easy.– How long to get the token?
• We see everything on the ring!– Can remember who sent
the last few packets, and when
– We know how far away the token is (last sender)– We can estimate how “fast” it “moves”
• Example: 7 nodes in 10 cycles (0.7 nodes/cycle)– If token 30 nodes away, estimated is 21 cycles (30*0.7)
• Detailed equations and explanations are in the paper
3 10
Core 3 sent packet on ring at cycle 10
Core 10 sent packet on ring at cycle 20
𝒅𝒌=𝟕
𝒕𝒌=𝟏𝟎
15/23
Threshold and Re-steering• Threshold adjusted to manage UTL ring utilization
– Low enough to avoid excessive queuing– But high enough not to waste the ring throughput– Target utilizations around 75% tend to work well
• Threshold Management– Packet steered to ring when its score exceeds the threshold– Increase threshold when ring utilization higher than desired– Decrease the threshold if ring utilization is too low
• Re-Steeringing– Sudden burst of high-scoring packets…
• Threshold adaptation takes a while• Meanwhile, ring packets have very long latencies
– If ring-steered packet sits in queue too long, re-steer to the mesh• How long is too long?
16/23
Evaluation• Simulated using SESC
– 64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2– 8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers)
• Applications from PARSEC 3, SPLASH-2 benchmark suites– Half of the applications show <20% improvement with ideal interconnect– Focus analysis on on-chip latency sensitive applications
raytr
ace
ocea
n-nc
lu-nc
ocea
nstr
eam
cls.
x264
radio
sity
barn
eslu-
cnbla
cksc
h.wat
er-sp
chole
sky fft
cann
eal
volre
ndbo
dytra
ckwat
er-ns
qfer
ret
fmm
radix
0.0
0.2
0.4
0.6
0.8
1.0
Nor
mal
ized
Exe
cutio
n Ti
me
17/23
Speedup
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.
gmean0.9
1.0
1.1
1.2
1.3
1.4
Spee
dup
1.14x
18/23
Speedup
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.
gmean0.9
1.0
1.1
1.2
1.3
1.4Mesh+TLCmesh+TLMeshCmeshSeries6
Spee
dup
• 4-concentrated mesh + UTL Ring– 8.7% improvement: 1.13× → 1.23×
19/23
Speedup
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.
gmean0.9
1.0
1.1
1.2
1.3
1.4Mesh+TLCmesh+TLFlat+TLMeshCmeshFlat
Spee
dup
• 4-concentrated mesh + UTL Ring– 8.7% improvement: 1.13× → 1.23×
• Flattened Butterfly + UTL Ring– 5.7% improvement: 1.10× → 1.16×
20/23
Summary• Increasing core counts worsens on-chip latency
• Unidirectional Transmission Line Ring – Low-latency– But limited throughput
• Use UTL Ring with switched interconnect synergistically– UTL Ring for low latency– Switched interconnect for throughput
• Adaptive traffic steering enables judicious use of the ring– Proposed traffic steering provides 14% performance improvement
21/23
Thank you!
22/23
Result: Latency Reduction of UTL Ring• UTL Ring latency is 55% lower than the mesh
– Lower latency than advanced interconnects– >44% latency reduction over concentrated mesh and flattened butterfly– But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits)
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.Avg
.0.00.10.20.30.40.50.60.70.80.91.0 Cmesh Flat TL
Nor
mal
ized
Pac
ket L
aten
cy
44.3%43.9%
23/23
Result: Speedup vs. Mesh Alone• Performs slightly better than advanced on-chip network
– 1.14 (Mesh + UTL ring)– vs. 1.13 (concentrated mesh) and 1.10 (flattened butterfly)
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.
gmean1
1.05
1.1
1.15
1.2
1.25
1.3 CmeshFlatTL
Spee
dup
1.14×
1.10×1.13×
24/23
Adaptive vs Non-Adaptive Steering• Non-adaptive random steering
– 0.63× slowdown on application (ocean-nc) with high on-chip traffic– 1.02× speedup if 30% of packets use UTL Ring randomly (RND30)– 0.96× slowdown if 50% (RND50)
• Adaptive traffic steering – 1.14×speedup (up to 1.20× with 64 Gbps configuration)
barneslu-cn lu-nc
ocean-nc
ocean
radiosity
raytrace
blacksch
.x2
64
streamcls
.
gmean0.800.850.900.951.001.051.101.151.201.251.30
RND50-16G RND30-16G TS-16G TS-32GTS-64G
Spee
dup
slowdown
Top Related