1 An Interconnect-Centric Approach to Cyclic Shifter Design David M. Harris Harvey Mudd College....
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of 1 An Interconnect-Centric Approach to Cyclic Shifter Design David M. Harris Harvey Mudd College....
1
An Interconnect-Centric Approach to Cyclic Shifter Design
David M. HarrisHarvey Mudd College.
Haikun Zhu, Yi ZhuC.-K. Cheng
Harvey Mudd College.
2
Outline
• Motivation• Previous Work• Approaches
• Fanout-Splitting• Cell order optimization by ILP
• Conclusions
3
Motivation
• Interconnect dominates gate in present process technology• Delay, power, reliability, process variation, etc.
• Conventional datapath design focuses on logic depth minimization
Source: ITRS roadmap 2005
4
Technology Trends
• Device (ITRS roadmap 2005, Table 40a)
• Updated Berkeley Predictive Interconnect Model
Gate length (nm) 90 65 % decreasing
Inter-layer dielectric constant
2.8 2.6 -
Capacitance of local interconnect (fF/um)
0.186 0.173 7.0%
Gate length (nm) 90 65 % decreasing
Vdd (V) 1.1 1.1 -
Vth (V) 0.195 0.165 -
NMOS gate Cap (fF/m) 0.573 0.469 18.2%
NMOS intrinsic delay (ps)
0.870 0.640 26.4%
5
Shifter Taxonomy
• Functionality• Logical Shift: MSBs stuffed with 0’s• Arithmetic Shift: Extend original MSB• Cyclic Shift (rotation)• Bidirectional Shift
• Circuit Topology• Barrel Shifter• Logarithmic Shifter
6
Barrel Shifter
Sh3Sh2Sh1Sh0
Sh3
Sh2
Sh1
A3
A2
A1
A0
B3
B2
B1
B0
: Control Wire
: Data Wire
BufferSh3Sh2Sh1Sh0
A3
A2
A1
A0
• Pros• Every data signal pass only one transmission gate
• Cons• Input capacitance is • # transistors =• Requires additional decoder for control signals
Schematic
layout
7
Logarithmic Shifter
Sh1 Sh1 Sh2 Sh2 Sh4 Sh4
A3
A2
A1
A0
B1
B0
B2
B3
Schematic
layout
• Pros• # transistors =
• Cons• Long inter-stage wires, especially for cyclic shiftersTarget of Optimization
8
Cyclic Shifter -- Applications• Finite Field Arithmetic
• In normal basis, squaring is done by cyclic shifting.• Encryption
• ShiftRows operation in Rijndael algorithm.• DCT processing unit
• Address generator• Bidirectional shifting
• Can be implemented as a cyclic shifter with additional masking logic
• CORDIC algorithm• etc …
9
Previous Work
• Bit interleave• Two dimensional folding strategy• Gate duplicating• Ternary shifting• Comparison between barrel shifter and
log shifter
10
Cyclic Shifter – Traditional Design
• MUX-based
S2
S1
S0
D7 D6 D5 D4 D3 D2 D1 D0
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(7,1) (6,1) (5,1) (4,1) (3,1) (2,1) (1,1) (0,1)
(7,2) (6,2) (5,2) (4,2) (3,2) (2,2) (1,2) (0,2)
(7,3) (6,3) (5,3) (4,3) (3,3) (2,3) (1,3) (0,3)
(7,0) (6,0) (5,0) (4,0) (3,0) (2,0) (1,0) (0,0)
• Shifting & non-shifting paths are intertwined together.
• Wire load on the critical path is
Timing
Power
• When configured to pass through, the non-shifting paths have to be switched as well.
11
Fanout Splitting Shifter
• Use DEMUXes instead of MUXes
Z7 Z6 Z5 Z4 Z3 Z2 Z1 Z0
D7 D6 D5 D4 D3 D2 D1 D0
S2
S1
S0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(7,1) (6,1) (5,1) (4,1) (3,1) (2,1) (1,1) (0,1)
(7,2) (6,2) (5,2) (4,2) (3,2) (2,2) (1,2) (0,2)
(7,3) (6,3) (5,3) (4,3) (3,3) (2,3) (1,3) (0,3)
(7,0) (6,0) (5,0) (4,0) (3,0) (2,0) (1,0) (0,0)
Power
• When shifting, non-shifting paths are at rest.
• Shifting & non-shifting paths are now decoupled.
• Wire load on the longest path is
Timing
12
Example
D7 D6 D5 D4 D3 D2 D1 D0
S2=1
S1=0
S0=1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(7,1) (6,1) (5,1) (4,1) (3,1) (2,1) (1,1) (0,1)
(7,2) (6,2) (5,2) (4,2) (3,2) (2,2) (1,2) (0,2)
(7,3) (6,3) (5,3) (4,3) (3,3) (2,3) (1,3) (0,3)
(7,0) (6,0) (5,0) (4,0) (3,0) (2,0) (1,0) (0,0)
D0 D7 D6 D5 D4 D3 D2 D1
D0 D7 D6 D5 D4 D3 D2 D1
D0 D7 D6 D5D4 D3 D2 D1
Right rotate 5 bits
Red lines are signal linesGreen lines are quiet lines
13
Dynamic Power Consumption
• Dynamic Power• Switching Probability
S2
S1
S00 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(7,1) (6,1) (5,1) (4,1) (3,1) (2,1) (1,1) (0,1)
(7,2) (6,2) (5,2) (4,2) (3,2) (2,2) (1,2) (0,2)
(7,3) (6,3) (5,3) (4,3) (3,3) (2,3) (1,3) (0,3)
(7,0) (6,0) (5,0) (4,0) (3,0) (2,0) (1,0) (0,0)
MUX based design
DEMUX based design
SP= 1/4
SP= 3/16
SP = Switching Probability
S2
S1
S0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
(7,1) (6,1) (5,1) (4,1) (3,1) (2,1) (1,1) (0,1)
(7,2) (6,2) (5,2) (4,2) (3,2) (2,2) (1,2) (0,2)
(7,3) (6,3) (5,3) (4,3) (3,3) (2,3) (1,3) (0,3)
(7,0) (6,0) (5,0) (4,0) (3,0) (2,0) (1,0) (0,0)
14
Gate Complexity
• Re-factoring design• No extra complexity at gate level,
both are
MUX-based DEMUX-based
15
DualityNAND gates network NOR gates network
• Duality provides flexibility for low level implementation• NAND gates are good for static CMOS.• NOR gates are good for dynamic circuits.
16
Cell Permutation
• Datapath usually assumes bit-slice structure• The cell order of the input/output stages must be
fixed• However, the cells in the intermediate stages are
free to permute.7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
>> 1-bit
>> 2-bit
>> 4-bit
fixed
fixed
free
17
Problem Statement
• Given• A N-bit rotator• Fixed linear order of the input/output stages
• Find• An optimal permutation scheme of the inter
mediate stages such that the longest path is minimized (or, the total wire length s.t. delay constraint).
18
ILP Formulation
• Introduce a set of binary decision variables • if and only if logic cell is at
physical location on level
• The solution space is fully defined by constraints
19
ILP Formulation (cont’)
• Minimum delay formulation
• Minimum power formulation
Which can be expanded into
objective
20
ILP Formulation (cont’)
• Represent the length of a single wire segment
• Formulating absolute operation
Psuedo-linear constraints discarded because we’re trying to minimize
21
Complexity
• Minimum total wire length formulation• The case of one level of free cells is a minimum
weight bipartite matching problem
• For the case of two or more levels of free cells, optimal polynomial algorithm is unknown
• Hardness of minimum delay formulation un-established.
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
>> 1-bit
>> 2-bit
Logic index
Physical location
. . .
. . .
. . .
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
22
ILP Complexity• The ILP formulation does not scale well
• Both #integer variables and #constraints are • CPLEX uses on branch & bound – exponential growth
• Sliding window scheme• Only cells in the window are allowed to permute• Consists of multiple passes; terminate when there is no
improvement between passes
WW = 8 columnsWH = 3 rowsHS = 4 columnsVS = 1 column
23
Power & Delay Evaluation
• Overall Flow
24
Optimal solution for 8-bit case
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0
>> 1-bit
>> 2-bit
>> 4-bit
0 0 00 0 0 0 0
0 0 00 0 0 0 0 11111117
11111117
7 7 1 7 1 9 1 3 1 3 1 3 1 3 1 3
7 7 9 3 3 3 3 3
7 7 7 7 9 7 3 7 3 11 3 11 3 13 3 7
7 7 9 7 11 11 13 7
7 6 5 4 3 2 1 0
6 5 4 3 7 2 1 0
3 4 2 6 7 5 1 0
7 6 5 4 3 2 1 0
>> 1-bit
>> 2-bit
>> 4-bit
0 0 00 0 0 0 0
1 1 01 1 4 0 0 11130000
11141111
4 2 2 2 4 1 4 5 4 3 5 5 1 4 1 3
4 2 4 5 4 5 4 3
8 4 7 5 8 8 4 7 8 4 7 7 4 6 3 8
8 7 8 7 8 7 6 8
A global optimal solution
25
16-bit and 32-bit cases16-bit, global optimal solution
32-bit, suboptimal solution by sliding window method
26
Results
27
Power-Delay Tradeoff
8-bit 16-bit
32-bit 64-bit
• Given Tmax constraint, optimize Ttotal
8-bit & 16-bit are global optimum by cplex
32-bit & 64-bit are suboptimal result by sliding window scheme
28
Outline
• Motivation• Previous Work• Approaches
• Fanout-Splitting• Cell order optimization by ILP
• Conclusions and future work
29
Conclusions & Future Work• We have proposed
• Fanout-splitting design• ILP based layout optimization
• Future directions• Extend the fanout splitting idea and ILP formulation to ternary shifter• Try alternative hierarchical approach to tackle the ILP complexity issu
e31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
30
Thank you!
The End
31
Derivation of switching probability
Gate level implementation ofMUX and DEMUX
Truth table
For DEMUX
For MUX
32
Evaluating interconnect effect
• Based on logical effort model• Technology independent• Easy to incorporate interconnect
effectGate delay
Logical effort, only depends on gate typeElectrical effort, depends on load cap
Parasitic delay, only depends on gate typeWire load is integrated into h
Electrical effort contributed by wire per column spanned Wire length normalized to cell width
33
Deciding
• Evaluate for a set of technology nodes
• It is safe to assume hw=1