Clock Distribution
description
Transcript of Clock Distribution
July 2010 1
Clock Distribution
Shmuel WimerBar Ilan Univ. Eng. Faculty
Technion, EE Faculty
July 2010 2
Clock System Architecture
GatersBuffers
Clocked Elements
Clock Distribution
Clock Generator
Chip
clk1 clk2
clk3
External Clock
gclkext_clk
Clock generator adjusts the global clock to the external clock.
Global clock is distributed across the chip.
Chip receives external clock through I/O pad.
Local drivers and gaters drive the physical clocks to clocked elements.
July 2010 3
Global Clock Generation
• Receives external clock signal and produce the global
clock distributed across the die.
• A large skew occurs between external clock and the
physical clocks at clocked elements due to delay of
distribution network (wires, buffers, gaters).
• Therefore, data at clocked elements is no more in sync
with data at I/O pins.
• Phased Locked Loop (PLL) compensates this delay.
• PLL can perform frequency multiplication to obtain the
required on-chip frequencies.
July 2010 4
Din
Dout
CLKinext_clk
PLL
ref_clk
fdbk_clk
clk_out
Synchronous Chip Interface with PLL
Din
Dout
CLKout
Chip A Chip B
Chip A communicates synchronously with chip B
Chip B uses the clock sent by chip A. Data in and out must be synchronized to the common clock.
A PLL produces the global clock of chip B such that it is in sync with the external clock.
clkgclk
Clock Distribution
July 2010 5
How PLL Works?
Voltage Controlled Oscillator
clk_out
ref_clk
fdbk_clk
Phase Detect
Up
DownI
I
Charge Pump
C
R
Vctrl
Loop Filter
M
N
July 2010 6
Phase – Frequency Detector (PFD)
Once the lagging signal arrives, a reset turns both Q_A and Q_B to zero.
D Q
CLR
D Q
CLR
Q_A: B should go faster
A
1
1 Q_B: B should go slower
B
The two flip-flops receive the signals at their clock input (one is usually a
reference and the other is the sampled).
The output of the leading flip-flop is 1 for the lead duration.
July 2010 7
A: reference
B: sampled
Q_A: sampled should go faster
The spikes at Q_B are a result of the delay of the AND gate driving the
CLR input of flip-flip and the internal delay from CLR to Q.
What happens when the reference and the sampled signals are a shift of each other?
Q_B: sampled should go slower
July 2010 8
A: reference
Sampled is more often 1-value than the reference is, since rising edge of
B occurs more often than rising edge of A.
What happens when the reference and the sampled signals have different frequencies?
B: sampled
Q_B: sampled should go slower
Q_A: sampled should go faster
July 2010 9
Charge Pump
D Q
CLR
D Q
CLR
CLK_ref
1
1
CLK_fdbk
Vctrl
faster
slower
Icp
Icp
Sup
Sdn
Converts PFD error (digital) to charge (analog), which then controls PLL VCO.
cp up faster dn slowerCharge is proportional to PFD widths: .Q I t I t
July 2010 10
Charge pump consists of current mirrors which are sources of constant
current. Device N1 is in saturation since its gate is connected to high voltage.
Ids (=Iin) depends only on Vgs. Vgs is similar in N2, hence Iout=Iin.
This is an ideal current source with infinite output impedance since Iout is
independent of N2 load; a change in output voltage doesn’t affect Iout.
Iin IoutV
N1 N2
Vss
Iin IoutV
P1 P2
Vcc
Current mirror works similarly for P transistors.
Current Mirror
July 2010 11
Vss
Vcc
faster
slower
Vcc
R
C
Vout
Switch – open when faster = 1
Switch – open when slower = 1
How Charge Pump Works?
P-type current mirror
II
N-type current mirror
III
I
R load – determines the current through current mirror
July 2010 12
Vcc
Vss
faster=1
slower=0
Vcc
R
C
Vout
Faster Mode Vout → Vcc
July 2010 13
Vcc
R
C
Vout
Vcc
Vss
faster=0
slower=1
Slower Mode Vout → Vss
July 2010 14
Loop Filter
Vcc
R
C
Vctrl+
-
Vctrl
Differential amplifier connected as a unity-gain follower is used.
July 2010 15
Voltage Controlled Oscillator (VCO)
A Ring Oscillator cascades an odd number of inverters and feeds back the
last output to first inverter (even number of inverters will be stable). It starts
to oscillate spontaneously.
Vout
inv
inv
If is the delay of an inverter and is the number of inverters,
oscillation frequency is 1 2
t n
f nt
Frequency can be controlled by number of inverters and supply voltage
of inverter (higher voltage obtains faster inverter).
July 2010 16
Ring of 5 inverters
Buffering for driving clk_out
Components of VCO
Level Converter from Vctrl-Vss to Vcc-Vss
clk_out
Vcc
Vss
Vctrl
+
- Vcc
--
Vcc -
+
July 2010 17
Delay Locked Loop (DLL)
• It is a variant of PLL that uses voltage-controlled delay
line rather than oscillator.
• It adjusts phase only. Frequency multiplication is
impossible.
• It is simpler than PLL, less sensitive to Vctrl noise and
requires simpler loop filter.
• It is very difficult to correctly design PLL and DLL. It
requires expertise in control systems and analog circuit
design.
Voltage Controlled Delay Line
July 2010 18
How DLL Works?
clk_out
ref_clk
fdbk_clk
Phase Detect
Up
DownI
I
Charge Pump
C
R
Vctrl
Loop Filter
July 2010 19
Clock Distribution Networks
July 2010 20
Tree Clock Network (Unconstrained)
clk1
clk2
clk3
clkn
No constraints imposed on buffers and wires.
Used mostly by automatic tools in automatic synthesis flows.
Can be used for small blocks within large design.
Tools aim at minimizing the variance of clock delays.
July 2010 21
1 1
2
If
1
is the clock delay at a leaf, the variance exapression
should be minimized. n n
i i
CLK
CLK CLK
i
i i
T
T Tn
Serpentine routing or extra buffers may be introduced
to obtain small variance.
Constraints on power can be imposed by limiting
number and size of clock buffers and width of wires.
July 2010 22
Clock Distribution with Grids
Grid feeds flops directly, no local buffers
Clock driver tree spans height of chipInternal levels shorted together
Low skew but high power
July 2010 23
Clock drivers are on perimeter Clock drivers are on grid points
July 2010 24
Delay and Skew in Grid Distribution
July 2010 25
DEC’s Alpha Microprocessor Clocking
July 2010 26
DEC Alpha 21264 Microprocessor Clock distribution
July 2010 27
Clock Distribution with Spines
July 2010 28
Intel’s Pentium4 Clock Distribution
July 2010 29
July 2010 30
Clock Distribution with Trees
Recursive pattern to distribute signals uniformly with equal delay over area
H-Tree
Each branch is individually routed to balance RC delay
RC-Tree
More skew but less power
July 2010 31
Clock H-Tree
clock / PLL
chip / functional block / IP
sequential elements
July 2010 32
IBM / Motorola PowerPC Clock Distribution
July 2010 33
Delay Calculation
We use Elmore delay model. Sub trees are modeled as capacitive loads
July 2010 34
Clock Skew and Jitter
• Clock should theoretically arrive simultaneously to all
sequential circuits.
• Practically it arrives in different times. The differences
are called clock skews.
• Skews result from paths mismatches, process variations
and ambient conditions, resulting physical clocks.
• Most systems distribute a global clock and then use
local clock gaters located near clocked elements.
Clock skew consists of the following components:
July 2010 35
• Systematic is the portion existing under nominal
conditions. It can be minimized by appropriate design.
• Random is caused by process variations like devices’
channel length, oxide thickness, threshold voltage, wire
thickness, width and space. It can be measured on
silicon and adjusted by delay components.
• Drift is caused by time-dependent environmental
variations, occurring relatively slowly. Compensation of
those must takes place periodically.
• Jitter is rapid clock changes, occurring by power noise
and clock generator jitter. It cannot be compensated.
July 2010 36
Factors affecting clock skew, Intel 1998, 0.25u.
July 2010 37
Skew, Clock Cycle and Design Margins
Clock Jitter is the same order as skew, but far more difficult to compensate.
July 2010 38
Skew Modeling
D
Q
CL
1 2 m
1 2 n
Clock Generator
Tclk1
Tclk2
CLK1 1
mii
T T m
CLK2 1
nii
T T n
l d CC - capacitive load, - drive current, - voltage swingC I V
l CC dC V I
Point of divergence
July 2010 39
CC l CClCC l d CC l d2
CC l d d d d
CC dl
CC l d
Consider a small change in the delay, taking the linear term.
V CVCV C I V C I
V C I I I I
V IC
V C I
Standard deviation of stage delay is = . is around 5%.
CLK1 CLK2
skew CLK2 CLK1 skew
If clock buffers delays are normally distributed and independent
of each other, then , and .
Skew is: | | .
T m T n
T T T m n
July 2010 40
Consider - level clock tree.m
Clock Distribution Switching Power
l_Let be the total load of its far-end driven sequential
elements.
mC
Assuming a fixed fanout of each clock tree buffer, the
dynamic power is:
k
2CLK_ l_ CC ,m mP C V f
l_ 2
CCCLK_ , 0 1.mm j j
CP V f j m
k
July 2010 41
Summing over whole clock tree:
1 2CLK CCl_0
1 l_2 2CC CC l_0
1 1
1 1
mm jj
mm m
mjj
P C V f
C kV f V fC
kk
CLK _
CLK
1 1.
1 1
mm
P k
P k
How much of the power is consumed by the far end drivers of clock tree?
July 2010 42
Given the number of sequential elements in a block, at
least 50% of the switching power is consumed by the
far end drivers (clock tree is binary, k=2). This number
approaches 1 rapidly with k growth.
Example: Assume a block with 214 sequential elements
and H-tree clock distribution. Then k=4 and m=7. The
far end drivers consume nearly 75% of the clock tree
switching power, while adding the next upper level
drivers brings it to more than 90%.
July 2010 43
Active Clock De-Skewing
• Compensates process variability, temperature gradients,
imperfect design.
• Can be implemented for global fixes (small HW
overhead) or local fixes (high HW overhead).
• Can be used at testing for one time fix (variability
occurring during manufacturing), or dynamically
concurrently with chip operation.
• Its implementation is a difficult design challenge.
July 2010 44
Intel’s Pentium2 De-Skewing System
1998, 450MHz Clock
0.25u process
60pSec skew w/o fix
15pSec skew with fix
Two clock spines for two clock regions.
A phase detector detects relative shifts.
Clock of a region is shifted by a delay line.
July 2010 45
Delay line consists of two cascaded inverters.
Each has a programmable load consists of eight parallel P-N gate
capacitors.
The shift register stores a thermometer code for load programming
in steps of 12pSec.
July 2010 46
Another Programmable Delay Line
The signal from input to output is delayed
or not according to the control bit.
In
Out22n 1
0
12n 1
0
1 1
0
1nD 2nD 0D
In
Out1
0
1 2 0
Connecting a series of delay elements, each of delay 2 , 1 0,
it is possible to control the dely of the line to any value from 0 to 2 1
dealy units by an -bit control word , , , .
i
n
n n
n n i
n D D D
July 2010 47
Intel’s IA64 Itanium1 De-Skewing System
2000, 800MHz clock
0.18u process
28pSec skew with fix
X4 increase w/o fix
30 independent de-skew regions. PLL
Each cluster is driven from a global H-tree.
Delay circuit in de-skew region are similar to Pentium3 with 20-bit registers.
July 2010 48
July 2010 49
Proposal for H-Tree Clock De-Skew – Hierarchical Approach
If a phase detector (PD) has a skew guard band g, then guard bands
may accumulate along tree paths.
For example, if a logic stage is shared between region B and C, it may
add 7g time units to path delay.
July 2010 50
Proposal for H-Tree Clock De-Skew – Mesh Approach
Clock is distributed by H-
tree, but de-skew takes
place by neighbor leaves
phase detection.
A delay buffer accepts
phase inputs from its 4
neighbors and then
decides of whether to
increase, decrease or
not change its delay.
July 2010 51
Clock Characteristics of Commercial Processors
Power Consumption in Chips
• Clock power may reach 50% of total (dynamic + static).
• Clock gating is very useful and standard design practice
• Four gating methods:– Synthesis based, automated by EDA tools, RTL
compilers, inserted into clock-tree– Clock enable signals manually defined by designer,
inserted into clock-tree, FFs’ clock input– Data-Driven clock gating, inserted at FF-level– Auto-gated FF, inserted at latch-level
52
FF Data Toggling (DSP core)
23k FFs, Test bench of 250k CLK cycles
53
10% of application run-time CLK is enabled.
Of which, only 1.6% CLK pulses are useful!
dat
a ac
tivi
ty /
clo
ck a
ctiv
ity
cum
ula
tive
no
rmal
ized
clo
ck
cap
acit
ive
load
block count
b
a
0.03
FF Data Toggling 63 control blocks of ϻP, 200k FFs
Data-Driven CLK Gating
55
56
tight constraint
Net saving per FF
Gater’s disabling probability
Latch overhead amortized over k FFs
Probability of enabling FF
How Many FFs To Group ?
k: # flip-flops, q: FF probability of D=Q q=1-p
Worst case : All FF are toggling independently of each other.
latch w OFFsaving
FF
w R1kc q c c k qc c c
Derivate by k: 2FF W latchln 0kq q c c c k
57
58
59
Optimal Flip-flop k-size Grouping
Given flip-flops and 1 clock cyclesn m
1, , is the activity (toggling) of flip-flopma aa
is the number of redundant clock pulses
ocurring by jointly clocking FF and FF
i j
i j
a a
FF1: 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0
FF2: 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 1
FF Pairwise Activity Model
60
, , : FF pairwise activity graph.G V E w
: vertex matchingE E
corresponds to FF . i iv V
, is FF pairing.ij i je v v E
| is joint toggling.i ja a
is redundant clock pulses, hence a waste.ij i jw e a a
61
| |i ij ij
i i i j j i jv V e E e E a a a a a a a
2 |ij
i je EP a a
Total power:
i ij iji
i i jv V e E iv V ije Ew e
a aa aEssential + Waste
FFi
FFj
FF FFi j
Flip-Flop Grouping Algorithm
62
Minimize: FF FFi j
Can be repeated for groups of size 4, 8, 16 …
Optimal FFs pairing
can be solved by
minimal cost perfect
graph matching.
63
1a 3a 2a 7a 4a 6a 5a 8a
1 2|a a3 4|a a
7 8|a a
6 5|a a
Is repeated perfect matching optimal ?
64
No! Here is the optimal 4-size grouping
Master Latch
Slave Latch
1-bit flip-flop
CLK
CLKCLK
D Q
Master Latch
Slave Latch
CLK
CLKCLK
D Q
Master Latch
Slave Latch
CLK CLK
CLK
Master Latch
Slave Latch
2-bit flip-flop
Q1
Q2
D1
D2
+ =
Multi-Bit Flip-FlopSaving the power of internal CLK drivers
65
66
67
MBFF should be combined with data-driven CG to maximize energy savings.
Toggling vectors (VCD) are unfortunately not always available.
Data-to-clock toggling ratio (probability) is more often available.
How to utilize it for MBFF optimal grouping?
68
1 1 2j i i j i j i jp p p p p p p p
What is the energy waste in 2-bit MBFF?
2
1 1
2nn
j s ti ij i
p p p
For n FFs grouped in n/2 MBFFs it is
What is the optimal MBFF grouping?
1 2 np p p Group the FFs such that
Auto-Gated FF
69
Look-Ahead CLK Gating
70
71
Relaxed constraint
Power Saving Per FF
72
FF+CLKX O A FF O
FF+CLK FF O
savedyn
int3
1k
cp c kc c c c
c
p c c c
Which FF to Gate?
73
Results
74
75