Low-Latency FIFO’s Using Token Rings
description
Transcript of Low-Latency FIFO’s Using Token Rings
Low-Latency FIFO’s Using Token Rings
Tiberiu Chelcea Steven M. Nowick
Columbia UniversityNew York, USA
Introduction (1)
Contributions
• Two novel FIFO designs:– Circular buffer of identical cells
– Distributed control
– Common buses
– Token passing: 2 tokens control I/O behavior
– No data movement
• Very low latency in an empty FIFO
• Still maintain high throughput
IntroductionTwo FIFO Protocols:
• Basic: simple, non-overlapped write/read to a cell
• Optimized: overlapped write/read to a cell– more concurrency per cell– various low-level optimizations:
• “early drive” of receiver’s data bus
• single-wire signaling, etc.
3 implementations of basic, 1 of optimized
HSpice simulations
Related Work (1)
Related Work
• Most FIFO’s targeted to high-throughput:– poor latency– data movement
• One solution: modify structure to obtain lower latency [Brunvand95]– types: folded, tree, square– drawbacks:
• data still moved• latency proportional to # of stages• complex critical paths
Related Work (2)
Low-Latency FIFO’sCommonly implemented as circular buffers
– no data movement
1. Centralized Control [Sutherland89, Yakovlev95]
Limitations:– complex centralized counters for head/tail positions– overhead: delay/area (including arbiters!)
2. Distributed Control [Yakovlev89, Kishinevsky93]
Limitations:– no overlapped put/get to same cell (unlike ours)– significant latencies (e.g. 3-stage delay)
Two closer approaches presented later on…
Summary (1)
Overview of the Talk
• Basic FIFO:– Basic Protocol
– Implementation
• Optimized FIFO:– Optimized Protocol
– Implementation
– Related Work
• Results• Conclusions
Basic Protocol (1)
FIFO Interface
• Interfaces to two environments:– sender communicates on put port– receiver communicates on get port
• FIFO allows concurrent puts (writes) and gets (reads)
FIFOput get
Basic Protocol (2)
FIFO Architecture
• FIFO = replicated cells + starter (circular buffer)
• put/get ports each consists of a common bus (data+control)
• Two tokens in FIFO: put token and get token– “Starter” cell places tokens in circulation
– no data movement
• When full, every cell contains data (capacity N)
Cell Cell Cell Cell
Sta
rter
put
get
Basic Protocol (3)
FIFO Simulation 1: Start
Sta
rter
put
get
P
G
PP
G
Put token requested
valid
put_req
Get token requested
Basic Protocol (4)
valid
FIFO Simulation 2: Steady-State Operation
Sta
rter
put
get
P
G
valid
P
G
P
valid
G
Pvalid
put_reqput_req
get_reqget_req
put_req
P
Basic Protocol (4)
valid
FIFO Simulation 2: Steady-State Operation
Sta
rter
put
get
P
validvalid valid
P
G
put_reqput_req
Basic Protocol (4)
valid
FIFO Simulation 3: Full
Sta
rter
put
get
P
G
P
valid validvalid
put_req: pending
G
get_req
Put token not passed: next cell not ready
Put request acknowledged
Put token requested
valid
Basic Protocol (6)
FIFO=*[[
]]
right;[put put?x];[left left];right;[get get!x];[left left];
Basic Cell Protocol
Cell
put
rightleft
get
forever {ObtainPutTokenEnqueueDataPassPutTokenObtainGetTokenDequeueDataPassGetToken}
do_put
do_get
Pseudo-code Program: CSP Program:
Basic Protocol (7)
Cell’s Handshake Behavior
• Port Activity:– put & get: passive– right: active– left: passive
• Channel Implementation: – 4-phase handshaking– bundled data: put and get– validity scheme [Peeters96]:
• get: “middle data validity” (ack+ req-)• put: early data validity (req+ ack+)
Basic Protocol Implementation (1)
Basic Cell Implementation #1: Tangram
“Starter Cell”: can also be implemented using Tangram
proc cell (put?T & get!T & right & left)
beginx: T
forever doright; put?x; left;right; get!x; left;
od end
Tangram program: Handshake circuit:
MUX
;
REG
; ;
; ;
MUX
get put
left right
#
Basic Protocol Implementation (2)
Basic Cell Implementation #2: Petrify
right_req+
right_ack+
right_req-
right_ack-
put_ack+
put_req-
put_ack-
left_ack+
left_req-
left_ack-
left_req-
put_req+
put_req-
right_req+
right_ack+
right_req-
right_ack-
get_ack+
get_req-
get_ack-
left_ack+
left_req-
left_ack-
left_req-
get_req+
get_req-
Basic Protocol Implementation (3)
Basic Cell Implementation #3: Burst-Mode
Decomposed into several communicating BM machines:
• Put/Get Controllers: handle put/get ports
• Left Controller: passes tokens to left
• Token Distributor: controls token flow to the three controllers
LeftController
Token Distributorleft
GetController
PutController REG
right
put get
ptok gtok
pass
Basic Protocol Implementation (4)
Put Controller
• Synchronizes handshakes on put and ptok channels:– put: environmental request– ptok: put token is in cell
• If cell has token (ptok_r+): cell does put operation
• If no token, no put: put_req+/- = partial input burst => ignored
put_req+ ptok_r+/put_ack+ ptok_a+
put_req- ptok_r-/put_ack- ptok_a-
GetController
PutController
LeftController
Token Distributor
REG
left right
put get
ptok gtok
pass
Basic Protocol Implementation (7)
Token Distributor
• Receives tokens from right channel
• Distributes tokens to Put and Get Controllers, respectively
• Passes tokens to Left Controller
GetController
PutController
LeftController
Token Distributor
REG
left right
put get
ptok gtok
pass
right_ack+/right_req-
pass_a+/pass_r-
pass_a-/right_req+
pass_a-/right_req+
right_ack-/ptok_r+
ptok_a+/ptok_r-
ptok_a-/pass_r+
pass_a+/pass_r-
gtok_a-/pass_r+
gtok_a+/gtok_r-
right_ack-/gtok_r+
right_ack+/right_req-
right_ack+/right_req-
right_ack-/
right
pass_r+
pass_a+/pass_r-
pass_a-/
pass
right_ack+/right_req-
right_ack-/
right_req+
right
pass_a+/pass_r-
pass_r+
pass_a-/
pass
ptok_r+
ptok_a+/ptok_r-
ptok_a-/
ptok
gtok_a+/gtok_r-
gtok_r+
gtok_a-/
gtok
Basic Protocol Implementation (8)
Token Distributor: Burst-Mode Implementation
• Synthesized with the MINIMALIST CAD Package [Fuhrer,Nowick et. al,99]
• Optimized for speed
pass_a
gtok_a
ptok_a
ra
pass_r
nrr
ptok_r
gtok_r
y2
y1
y0
Summary (2)
Overview of the Talk
• Basic FIFO:– Basic Protocol
– Implementation
• Optimized FIFO:– Optimized Protocol
– Implementation
– Related Work
• Results• Conclusions
Optimized Protocol (1)
Problems with Basic Protocol
No “Program-Level Parallelism”:• no overlapped write/read to same cell• large latency• poor throughput• two tokens “multiplexed” onto single channels
Limited Low-Level Optimizations:• “late enable” of get data bus• handshake overheads• limited fine-grained concurrency
Basic Protocol: Sequential Program
Actions strictly sequential
Latency (3 actions):– EnqueueData– PassPutToken– ObtainGetToken[DequeueData]
Throughput (3 actions):– ObtainPutToken– EnqueueData– PassPutToken
ObtainPutToken
EnqueueData
PassPutToken
ObtainGetToken
DequeueData
PassGetToken
to th
e le
ft c
ell
from
the
righ
t cel
l
Optimized Protocol (2)
Optimized Protocol: Concurrent Program
Token passing: off critical paths
Latency: 1 action
Throughput: 2 actions
Further low-level optimizations:
– effectively improve throughput to
1 action
ObtainPutToken
EnqueueData PassPutToken
ObtainGetToken
DequeueData PassGetToken
to th
e le
ft c
ell
from
the
righ
t cel
l
Optimized Protocol (3)
Architectural Modifications
• Tokens passed on two separate channels
• One cell can hold both tokens simultaneously:– allows overlapped writes and reads
• get token may be briefly ahead of the put token !
• No explicit “Starter” cell
Cell
Put
Get
CellCell Cell
Optimized Protocol Implementation (1)
Optimized Cell Architecture
• ObtainPutToken: receives put token
• ObtainGetToken: receives get token
• PutController: handles communication on put channel
• GetController: handles communication on get channel
• DataValid: indicates the validity of REG contents
PutController
ObtainPut Token
ObtainGet Token
GetController
REGDataValid
we
re
we1
re1
get
put ObtainPutToken
ObtainPut Token
PutController
PutController ObtainGetToken
ObtainGet Token
GetController
GetController
DataValid
DataValid
Optimized Protocol Implementation (2)
Optimized Cell Implementation
• OPT/OGT: Burst-mode machines
• DV: uses relative timing (synthesized using Petrify)
• PC/GC: asymmetric C-elements
• Optimizations:
– “early data out” enabling
– single-wire token passing
we1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Optimized Protocol Implementation (3)
Enqueuing Data
• put token received on we1= single wire
• we+ (when request & token) triggers:– latching data– start passing put token– resetting OPT
we1+/
we1-/ptok+
we+/ptok-
we-/
ObtainPutTokenwe1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
ptok
Optimized Protocol Implementation (5)
Data Valid
Asymmetric protocol:– data valid: in active
phase of put (we+)– data invalid: in RZ
phase of get (re-)– avoids overwrite by next
put
we+
valid+
we- re+
re-
valid-
we1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Optimized Protocol Implementation (5)
Early Enable: Get Data Bus
Early Enable = get token in cell
Late Enable = get token + get request
• Extra slack to meet bundling constraints
we1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Optimized Protocol Implementation (6)
we1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Timing Constraints 1. Pulse-Width Requirements
• 2 pulse width constraints
• re and we - race between:– state change– environment path
• easily met
• DV synthesized using Petrify (“slowenv” option)
Optimized Protocol Implementation (7)
we1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Timing Constraints 2. Bundling Constraint• Get operation:
get_ack must indicate valid data
• Bundling constraint: get_data faster than get_ack+
• Moderate size FIFO’s: easy to meet
• Very large FIFO’s: padded delays on control
• “Early drive” of get_data alleviates the problem: extra slack
Related Work (3)
Related Work: Close Approaches
• Two Designs [Yi95, Chu86]:– use: circular arrays, common data buses, token passing
• “Word-Slice FIFO” [Yi95]:– worse throughput for get than ours (10 gates vs. 6)
– tighter bundling constraints: uses “late read enable”
• FIFO for Packet Networks[Chu86]:– worse throughput for put than ours (6 block delays vs.
4)
– tighter bundling constraints: uses “late read enable”
Summary (3)
Overview of the Talk
• Basic FIFO:– Basic protocol
– Implementation
• Optimized FIFO:– Optimized protocol
– Implementation
– Related Work
• Results• Conclusions
Results (1)
Results
• HSpice simulations: 0.6HP CMOS, 3.3V, 27C
• Word size: 8 bits
• Buses modeled carefully:
– wire lengths, load
– attached capacitance
• Various experiments:– FIFO capacity (4- vs. 16-place)– environmental latency (slow vs. fast)
Results (1)
Results: Latency
Basic Optimized
(ns) Tangram Petrify(centralized)
Burst-Mode(distributed)
FIFO 4-S 13.76 12.54 7.94 1.73
FIFO 4-F 13.75 12.54 7.81 1.73
FIFO 16-S 14.32 13.01 8.52 2.30
FIFO 16-F 14.13 12.90 8.41 2.29
S= slow environmentF= fast environment
Results (2)
Results: Throughput
367167202162
348164196161
454175216162
427172208161
get
put
get
put
get
put
get
put
423204200185FIFO 4-F
335191190175FIFO 16-S
404200200185FIFO 4-S
OptimizedBasic
359192195179FIFO 16-F
Burst-Mode(distributed)
Petrify(centralized)
Tangram(MegaOps/s)
Conclusions (1)
Conclusions• Presented novel FIFO designs
• Two protocols: basic, optimized
– circular buffers
– common buses
– token passing
• Very low latency achieved by protocol manipulation
• Maintain high throughput
• Potential for low power: no data movement
Basic Protocol (5)
FIFO Behavior: Empty
Sta
rter
Put
Get
P
G
P
G
Basic Protocol Implementation (5)
Get Controller
• Triggered by a (i) Get request and (ii) Get token• Synchronizes handshaking on Get and Gtok channels• If no token, only get_req+ can arrive = partial input burst• If token (gtok_r+), then get_req+ becomes an input burst
get_req+ gtok_r+/get_ack+ gtok_a+
get_req- gtok_r-/get_ack- gtok_a-
GetController
PutController
LeftController
Token Distributor
REG
left right
put get
ptok gtok
pass
Basic Protocol Implementation (6)
Left Controller
• Waits for a request for tokens and their availability• Completes handshaking on both Left and Pass channel
left_req+ pass_r+/left_ack+ pass_a+
left_req- pass_r-/left_ack- pass_a-
GetController
PutController
LeftController
Token Distributor
REG
left right
put get
ptok gtok
pass
Introduction (2)
Overview of Approach
• The FIFO interfaces two environments
• Circular structure of identical cells
• Cells connected to common data and control buses
• Two tokens dictate the I/O behavior– put token selects the input cell
– get token selects the output cell
• Once enqueued, data is not moved until dequeuing. Thus the potential for low latency
Introduction
• Distributed control
• Circular buffer of identical cells
• Common buses: all cells communicate on them
• Token passing determines the I/O behavior
• FIFO allows concurrent reads/writes
• When full, every cell contains data (capacity N)
Basic Protocol Implementation (4)
Put Controller
• If token (ptok_r+), cell does the put operation
• If no token, no put: put_req+/- partial input burst => ignored
– put_req+/put_req- partial input bursts => ignored
– burst-mode implmentation handles this behavior
put_req+ ptok_r+/put_ack+ ptok_a+
put_req- ptok_r-/put_ack- ptok_a-
GetController
PutController
LeftController
Token Distributor
REG
left right
put get
ptok gtok
pass
Optimized Protocol Implementation (4)
Dequeuing Data
• get token received on re1: single wire
– no 4-phase handshaking
re1+/
re1-/gtok+
re+ /
re-/gtok-
ObtainGetTokenwe1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Optimized Protocol Implementation (4)
Dequeuing Data
• When there is get request, generate re to:– start passing the get
token– ack the receiver– start reseting OGT
re1+/
re1-/gtok+
re+ /
re-/gtok-
ObtainGetTokenwe1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
Optimized Protocol Implementation (3)
Enqueuing Data
• When there is put request, generate we to:– latch data
– start passing put token
– reset OPT
we1+/
we1-/ptok+
we+/ptok-
we-/
ObtainPutTokenwe1
re1
put_req put_data put_ack
GC
REG
C++
C+
get_reqget_ack get_data
we
re
OPT
OGT
PC
+C DV
ptok