Clock Distribution
description
Transcript of Clock Distribution
![Page 1: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/1.jpg)
July 2010 1
Clock Distribution
Shmuel WimerBar Ilan Univ. Eng. Faculty
Technion, EE Faculty
![Page 2: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/2.jpg)
July 2010 2
Clock System Architecture
GatersBuffers
Clocked Elements
Clock Distribution
Clock Generator
Chip
clk1 clk2
clk3
External Clock
gclkext_clk
Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Chip receives external clock through I/O pad.
Local drivers and gaters drive the physical clocks to clocked elements.
![Page 3: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/3.jpg)
July 2010 3
Global Clock Generation
• Receives external clock signal and produce the global
clock distributed across the die.
• A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
• Therefore, data at clocked elements is no more in sync
with data at I/O pins.
• Phased Locked Loop (PLL) compensates this delay.
• PLL can perform frequency multiplication to obtain the
required on-chip frequencies.
![Page 4: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/4.jpg)
July 2010 4
Din
Dout
CLKinext_clk
PLL
ref_clk
fdbk_clk
clk_out
Synchronous Chip Interface with PLL
Din
Dout
CLKout
Chip A Chip B
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock.
A PLL produces the global clock of chip B such that it is in sync with the external clock.
clkgclk
Clock Distribution
![Page 5: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/5.jpg)
July 2010 5
How PLL Works?
Voltage Controlled Oscillator
clk_out
ref_clk
fdbk_clk
Phase Detect
Up
DownI
I
Charge Pump
C
R
Vctrl
Loop Filter
M
N
![Page 6: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/6.jpg)
July 2010 6
Phase – Frequency Detector (PFD)
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
D Q
CLR
D Q
CLR
Q_A: B should go faster
A
1
1 Q_B: B should go slower
B
The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
The output of the leading flip-flop is 1 for the lead duration.
![Page 7: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/7.jpg)
July 2010 7
A: reference
B: sampled
Q_A: sampled should go faster
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
What happens when the reference and the sampled signals are a shift of each other?
Q_B: sampled should go slower
![Page 8: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/8.jpg)
July 2010 8
A: reference
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
What happens when the reference and the sampled signals have different frequencies?
B: sampled
Q_B: sampled should go slower
Q_A: sampled should go faster
![Page 9: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/9.jpg)
July 2010 9
Charge Pump
D Q
CLR
D Q
CLR
CLK_ref
1
1
CLK_fdbk
Vctrl
faster
slower
Icp
Icp
Sup
Sdn
Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
cp up faster dn slowerCharge is proportional to PFD widths: .Q I t I t
![Page 10: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/10.jpg)
July 2010 10
Charge pump consists of current mirrors which are sources of constant
current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.
Iin IoutV
N1 N2
Vss
Iin IoutV
P1 P2
Vcc
Current mirror works similarly for P transistors.
Current Mirror
![Page 11: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/11.jpg)
July 2010 11
Vss
Vcc
faster
slower
Vcc
R
C
Vout
Switch – open when faster = 1
Switch – open when slower = 1
How Charge Pump Works?
P-type current mirror
II
N-type current mirror
III
I
R load – determines the current through current mirror
![Page 12: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/12.jpg)
July 2010 12
Vcc
Vss
faster=1
slower=0
Vcc
R
C
Vout
Faster Mode Vout → Vcc
![Page 13: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/13.jpg)
July 2010 13
Vcc
R
C
Vout
Vcc
Vss
faster=0
slower=1
Slower Mode Vout → Vss
![Page 14: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/14.jpg)
July 2010 14
Loop Filter
Vcc
R
C
Vctrl+
-
Vctrl
Differential amplifier connected as a unity-gain follower is used.
![Page 15: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/15.jpg)
July 2010 15
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.
Vout
inv
inv
If is the delay of an inverter and is the number of inverters,
oscillation frequency is 1 2
t n
f nt
Frequency can be controlled by number of inverters and supply voltage
of inverter (higher voltage obtains faster inverter).
![Page 16: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/16.jpg)
July 2010 16
Ring of 5 inverters
Buffering for driving clk_out
Components of VCO
Level Converter from Vctrl-Vss to Vcc-Vss
clk_out
Vcc
Vss
Vctrl
+
- Vcc
--
Vcc -
+
![Page 17: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/17.jpg)
July 2010 17
Delay Locked Loop (DLL)
• It is a variant of PLL that uses voltage-controlled delay
line rather than oscillator.
• It adjusts phase only. Frequency multiplication is
impossible.
• It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
• It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.
![Page 18: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/18.jpg)
Voltage Controlled Delay Line
July 2010 18
How DLL Works?
clk_out
ref_clk
fdbk_clk
Phase Detect
Up
DownI
I
Charge Pump
C
R
Vctrl
Loop Filter
![Page 19: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/19.jpg)
July 2010 19
Clock Distribution Networks
![Page 20: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/20.jpg)
July 2010 20
Tree Clock Network (Unconstrained)
clk1
clk2
clk3
clkn
No constraints imposed on buffers and wires.
Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.
![Page 21: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/21.jpg)
July 2010 21
1 1
2
If
1
is the clock delay at a leaf, the variance exapression
should be minimized. n n
i i
CLK
CLK CLK
i
i i
T
T Tn
Serpentine routing or extra buffers may be introduced
to obtain small variance.
Constraints on power can be imposed by limiting
number and size of clock buffers and width of wires.
![Page 22: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/22.jpg)
July 2010 22
Clock Distribution with Grids
Grid feeds flops directly, no local buffers
Clock driver tree spans height of chipInternal levels shorted together
Low skew but high power
![Page 23: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/23.jpg)
July 2010 23
Clock drivers are on perimeter Clock drivers are on grid points
![Page 24: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/24.jpg)
July 2010 24
Delay and Skew in Grid Distribution
![Page 25: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/25.jpg)
July 2010 25
DEC’s Alpha Microprocessor Clocking
![Page 26: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/26.jpg)
July 2010 26
DEC Alpha 21264 Microprocessor Clock distribution
![Page 27: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/27.jpg)
July 2010 27
Clock Distribution with Spines
![Page 28: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/28.jpg)
July 2010 28
Intel’s Pentium4 Clock Distribution
![Page 29: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/29.jpg)
July 2010 29
![Page 30: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/30.jpg)
July 2010 30
Clock Distribution with Trees
Recursive pattern to distribute signals uniformly with equal delay over area
H-Tree
Each branch is individually routed to balance RC delay
RC-Tree
More skew but less power
![Page 31: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/31.jpg)
July 2010 31
Clock H-Tree
clock / PLL
chip / functional block / IP
sequential elements
![Page 32: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/32.jpg)
July 2010 32
IBM / Motorola PowerPC Clock Distribution
![Page 33: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/33.jpg)
July 2010 33
Delay Calculation
We use Elmore delay model. Sub trees are modeled as capacitive loads
![Page 34: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/34.jpg)
July 2010 34
Clock Skew and Jitter
• Clock should theoretically arrive simultaneously to all
sequential circuits.
• Practically it arrives in different times. The differences
are called clock skews.
• Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
• Most systems distribute a global clock and then use
local clock gaters located near clocked elements.
Clock skew consists of the following components:
![Page 35: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/35.jpg)
July 2010 35
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on
silicon and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
![Page 36: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/36.jpg)
July 2010 36
Factors affecting clock skew, Intel 1998, 0.25u.
![Page 37: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/37.jpg)
July 2010 37
Skew, Clock Cycle and Design Margins
Clock Jitter is the same order as skew, but far more difficult to compensate.
![Page 38: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/38.jpg)
July 2010 38
Skew Modeling
D
Q
CL
1 2 m
1 2 n
Clock Generator
Tclk1
Tclk2
CLK1 1
mii
T T m
CLK2 1
nii
T T n
l d CC - capacitive load, - drive current, - voltage swingC I V
l CC dC V I
Point of divergence
![Page 39: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/39.jpg)
July 2010 39
CC l CClCC l d CC l d2
CC l d d d d
CC dl
CC l d
Consider a small change in the delay, taking the linear term.
V CVCV C I V C I
V C I I I I
V IC
V C I
Standard deviation of stage delay is = . is around 5%.
CLK1 CLK2
skew CLK2 CLK1 skew
If clock buffers delays are normally distributed and independent
of each other, then , and .
Skew is: | | .
T m T n
T T T m n
![Page 40: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/40.jpg)
July 2010 40
Consider - level clock tree.m
Clock Distribution Switching Power
l_Let be the total load of its far-end driven sequential
elements.
mC
Assuming a fixed fanout of each clock tree buffer, the
dynamic power is:
k
2CLK_ l_ CC ,m mP C V f
l_ 2
CCCLK_ , 0 1.mm j j
CP V f j m
k
![Page 41: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/41.jpg)
July 2010 41
Summing over whole clock tree:
1 2CLK CCl_0
1 l_2 2CC CC l_0
1 1
1 1
mm jj
mm m
mjj
P C V f
C kV f V fC
kk
CLK _
CLK
1 1.
1 1
mm
P k
P k
How much of the power is consumed by the far end drivers of clock tree?
![Page 42: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/42.jpg)
July 2010 42
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
Example: Assume a block with 214 sequential elements
and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.
![Page 43: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/43.jpg)
July 2010 43
Active Clock De-Skewing
• Compensates process variability, temperature gradients,
imperfect design.
• Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
• Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
• Its implementation is a difficult design challenge.
![Page 44: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/44.jpg)
July 2010 44
Intel’s Pentium2 De-Skewing System
1998, 450MHz Clock
0.25u process
60pSec skew w/o fix
15pSec skew with fix
Two clock spines for two clock regions.
A phase detector detects relative shifts.
Clock of a region is shifted by a delay line.
![Page 45: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/45.jpg)
July 2010 45
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
![Page 46: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/46.jpg)
July 2010 46
Another Programmable Delay Line
The signal from input to output is delayed
or not according to the control bit.
In
Out22n 1
0
12n 1
0
1 1
0
1nD 2nD 0D
In
Out1
0
1 2 0
Connecting a series of delay elements, each of delay 2 , 1 0,
it is possible to control the dely of the line to any value from 0 to 2 1
dealy units by an -bit control word , , , .
i
n
n n
n n i
n D D D
![Page 47: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/47.jpg)
July 2010 47
Intel’s IA64 Itanium1 De-Skewing System
2000, 800MHz clock
0.18u process
28pSec skew with fix
X4 increase w/o fix
30 independent de-skew regions. PLL
Each cluster is driven from a global H-tree.
Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers.
![Page 48: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/48.jpg)
July 2010 48
![Page 49: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/49.jpg)
July 2010 49
Proposal for H-Tree Clock De-Skew – Hierarchical Approach
If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
![Page 50: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/50.jpg)
July 2010 50
Proposal for H-Tree Clock De-Skew – Mesh Approach
Clock is distributed by H-
tree, but de-skew takes
place by neighbor leaves
phase detection.
A delay buffer accepts
phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.
![Page 51: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/51.jpg)
July 2010 51
Clock Characteristics of Commercial Processors
![Page 52: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/52.jpg)
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic + static).
• Clock gating is very useful and standard design practice
• Four gating methods:– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input– Data-Driven clock gating, inserted at FF-level– Auto-gated FF, inserted at latch-level
52
![Page 53: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/53.jpg)
FF Data Toggling (DSP core)
23k FFs, Test bench of 250k CLK cycles
53
10% of application run-time CLK is enabled.
Of which, only 1.6% CLK pulses are useful!
![Page 54: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/54.jpg)
dat
a ac
tivi
ty /
clo
ck a
ctiv
ity
cum
ula
tive
no
rmal
ized
clo
ck
cap
acit
ive
load
block count
b
a
0.03
FF Data Toggling 63 control blocks of ϻP, 200k FFs
![Page 55: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/55.jpg)
Data-Driven CLK Gating
55
![Page 56: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/56.jpg)
56
tight constraint
![Page 57: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/57.jpg)
Net saving per FF
Gater’s disabling probability
Latch overhead amortized over k FFs
Probability of enabling FF
How Many FFs To Group ?
k: # flip-flops, q: FF probability of D=Q q=1-p
Worst case : All FF are toggling independently of each other.
latch w OFFsaving
FF
w R1kc q c c k qc c c
Derivate by k: 2FF W latchln 0kq q c c c k
57
![Page 58: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/58.jpg)
58
![Page 59: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/59.jpg)
59
Optimal Flip-flop k-size Grouping
Given flip-flops and 1 clock cyclesn m
1, , is the activity (toggling) of flip-flopma aa
is the number of redundant clock pulses
ocurring by jointly clocking FF and FF
i j
i j
a a
FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0
FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1
![Page 60: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/60.jpg)
FF Pairwise Activity Model
60
, , : FF pairwise activity graph.G V E w
: vertex matchingE E
corresponds to FF . i iv V
, is FF pairing.ij i je v v E
| is joint toggling.i ja a
is redundant clock pulses, hence a waste.ij i jw e a a
![Page 61: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/61.jpg)
61
| |i ij ij
i i i j j i jv V e E e E a a a a a a a
2 |ij
i je EP a a
Total power:
i ij iji
i i jv V e E iv V ije Ew e
a aa aEssential + Waste
![Page 62: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/62.jpg)
FFi
FFj
FF FFi j
Flip-Flop Grouping Algorithm
62
Minimize: FF FFi j
Can be repeated for groups of size 4, 8, 16 …
Optimal FFs pairing
can be solved by
minimal cost perfect
graph matching.
![Page 63: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/63.jpg)
63
1a 3a 2a 7a 4a 6a 5a 8a
1 2|a a3 4|a a
7 8|a a
6 5|a a
Is repeated perfect matching optimal ?
![Page 64: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/64.jpg)
64
No! Here is the optimal 4-size grouping
![Page 65: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/65.jpg)
Master Latch
Slave Latch
1-bit flip-flop
CLK
CLKCLK
D Q
Master Latch
Slave Latch
CLK
CLKCLK
D Q
Master Latch
Slave Latch
CLK CLK
CLK
Master Latch
Slave Latch
2-bit flip-flop
Q1
Q2
D1
D2
+ =
Multi-Bit Flip-FlopSaving the power of internal CLK drivers
65
![Page 66: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/66.jpg)
66
![Page 67: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/67.jpg)
67
MBFF should be combined with data-driven CG to maximize energy savings.
Toggling vectors (VCD) are unfortunately not always available.
Data-to-clock toggling ratio (probability) is more often available.
How to utilize it for MBFF optimal grouping?
![Page 68: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/68.jpg)
68
1 1 2j i i j i j i jp p p p p p p p
What is the energy waste in 2-bit MBFF?
2
1 1
2nn
j s ti ij i
p p p
For n FFs grouped in n/2 MBFFs it is
What is the optimal MBFF grouping?
1 2 np p p Group the FFs such that
![Page 69: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/69.jpg)
Auto-Gated FF
69
![Page 70: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/70.jpg)
Look-Ahead CLK Gating
70
![Page 71: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/71.jpg)
71
Relaxed constraint
![Page 72: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/72.jpg)
Power Saving Per FF
72
FF+CLKX O A FF O
FF+CLK FF O
savedyn
int3
1k
cp c kc c c c
c
p c c c
![Page 73: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/73.jpg)
Which FF to Gate?
73
![Page 74: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/74.jpg)
Results
74
![Page 75: Clock Distribution](https://reader035.fdocuments.us/reader035/viewer/2022081519/56814634550346895db3410a/html5/thumbnails/75.jpg)
75