Conventional & Unconventional Applications of FPGA

Conventional & Unconventional Applications of FPGA

Wu, Jinyuan

Fermilab

Oct, 2010

Fermi National Accelerator Laboratory

Colliding Experiments

Oct. 2010, Wu Jinyuan, Fermilab [email protected]

Conventional & Unconventional Applications of FPGA 4

Introduction There is no clear distinction between conventional

and unconventional applications of FPGA. The range of FPGA application is very likely to

be broader than we can image. The outline of this talk:

The Starting Point Topics on Averages Using FPGA as ADC TDC Implemented with FPGA Serial Communication Between FPGA Devices Doublet Finding in Trigger System Triplet Finding in Trigger System

The Starting Point





Cares Must Be Taken Outside FPGA (1)

FPGA

ADCShaperLP Filter

BandLimiting

Spectrum ofOriginal Signal

LP filter

ADC Input

SamplingIn ADC

Aliasing w/oLP Filtering

Nyquist Frequency <(1/2) Sampling Frequency



The “Trend” vs. The Sampling Theorem

There will be no hardware analog

processing. Everything is done

digitally in software.

It sounds very stylish

A shaper/low-pass filter is a minimum requirement.



Cares Must Be Taken Outside FPGA (2)

FPGA

ADCShaperLP Filter

n Dither

51

52

53

54

0 50 100 150

Sampling Index

ADC

Signal Signal+Noise ADC(signal+noise) Weighted Average Threshold

51

52

53

54

0 50 100 150

Sampling Index

ADC

Signal ADC(signal) Threshold

Resolution finer than the ADC LSB can be achieved by adding noise at ADC input and digital filtering.



Adding Noise for Finer Resolution

Photo Credit: www.telegraph.co.uk, trinities.org

Mechanical pressure gauges usually do not track small pressure changes well.

The gauge readers may lightly tap the gauges to get more accurate reading.

The idea of dithering at ADC input is similar.



Why Band Limiting & Dithering are Ignored? Pre-amplifiers usually have a naturally limited

bandwidth and an intrinsic noise larger than the LSB of the ADC.

So a lot of time, band limiting and dithering can be “safely” ignored since they are satisfied automatically.

High bandwidth, low noise devices now become easily accessible. A design can be too fast and too quiet.

Do not forget to review the band limiting and dithering requirements for each design.

Topics on Averages





From Sum to Average

When N=2^k values are summed, word length +k bits. These k bits are fractional bits of the average.

＋）

Integer Fractional

4



Gain of Measurement Precision

When N independent measured values are averaged, precision improves by a factor of 1/sqrt(N).

If N=2^k, precision gain k/2 bits, (not k bits).

64

＋）

Integer Fractional

N Word Length

Precision Gain

4 +2 bits +1 bits

16 +4 bits +2 bits

64 +6 bits +3 bits

256 +8 bits +4 bits

€

σN =σ1

N



Weighted Averages The weighted

average is a special case of inner product.

Multipliers are usually needed.

y1y2y3y4y5y6y7

∑

∑

∑

=

=

=

iii

iii

iii

ye

ydh

ycy

η

0

c1

c2

c3

c4

c5

c6

c7

d1

d2

d3

d4

d5

d6

d7

e1

e2

e3

e4

e5

e6

e7

X

X

X

€

c ii

∑ =1

dii

∑ =1

eii

∑ =1



Exponentially Weighted Average No multipliers are

needed. The average is

available at any time.

It can be used to track pedestal of the input signals.

s[n]=s[n-1]+(x[n]-s[n-1])/NN=2, 4, 8, 16, 32, …

0

2

4

6

8

0 50 100 150 200 250 300 350 400 450 500

N(t) S(t) Y=S+N Ped32 Ped64 Ped128

+s[n-1]

x[n] -s[n]

1/2K

1/2K



Sliding Sum/Sliding Average

For each input point, a sliding sum is computed. It is preferable to implement sliding sum in recursive fashion. Recursive implementation uses much less resources.

x[n]

s[n]

+

s[n]

-x[n-K]

x[n]

€

sn = x ii=n −k −1

n

∑

€

sn = sn −1 + xn − xn −k



Sliding Average as a Low-Pass filter

Sliding average removes high frequency random noise.

+

s[n]

-x[n-K]

x[n]

The CIC Filters

SlidingSum

Cascaded IntegratorComb (CIC) Sum of 2nd Order

∑−−

=

⋅=)1(

][1][Km

mk

kxms

∑−−

=

⋅=)12(

][][][Kn

nk

kxkhny

• The CIC-2 filter is a weighted average.

• Sliding Sum = CIC-1 sum.

• The frequency response of CIC-2 sum is a sinc2(x) function that has 2nd order zeros and better stop band suppression.

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

x

sinc(x)

sinc^2(x)

The Zero(e.g. 360Hz)

FrequencyOct. 2010, Wu Jinyuan, Fermilab [email protected] 18


The CIC-2 Sum as a Low-Pass Filter

SlidingSum

CIC-2Sum

Oct. 2010, Wu Jinyuan, Fermilab [email protected] 19


+

s[n]

-x[n-K]

x[n]

+y[n]

-s[n-K]

+u[n]

-2x[n-K]

x[n]

+y[n]

x[n-2K]

• No Multipliers are needed to implement the CIC-2 sum.

Using FPGA as ADC





The Single Slope ADC

SignalSource

FPGA

VREF

ShaperLine

Driver

Shaper

Shaper

Shaper

ADC

ADC

ADC

ADC

Shaper

FPGA

TDC

LineDriver

LineDriver

LineDriver

SignalSource

Shaper TDC

Shaper TDC

Shaper TDC

Analog signal of each channel from the shaper is fed to a comparator and compared with a common ramping reference voltage VREF.

Pulses, rather than analog signals are transmitted on the cable.

The times of transitions representing input voltage values are digitized by TDC blocks inside FPGA.

This approach sometimes is (mistakenly) refereed as “Wilkinson ADC”.

T1

V1

T2

V2



Consider sampling rate at 2 MHz, the whole ramping cycle is 500 ns. Arrange 409.6 ns for upward ramping. To achieve 12-bit ADC precision, the TDC LSB is (409.6 ns)/4096 = 100 ps. TDC with 100 ps LSB can be comfortably implemented in FPGA today.

TDC Resolution Requirement

T1

V1

T2

V2

500 ns



Typical ADC devices creates noise that may interfere the analog circuits. The time interval for resetting of the common reference voltage may be

noisy but analog signal is not sampled during it. There is no digital control activities during ramping up of the common

reference voltage.

Digital Noise During Digitization

T1

V1 T2

V2

Noisy NoisyClean Clean

ADCShaper



Single Slope ADC Test: Waveform Digitization

1

1.5

2

2.5

2500 3000 3500 4000 4500 5000 5500

t(ns)

Leading Ramp Trailing Ramp

0

8

16

24

32

40

48

56

64

0 32 64 96 128 160 192 224 256

Leading Ramp Trailing Ramp

RawData

Input Waveform, Overlap Trigger& Reference Voltage

Calibrated

FPGA

TDC

TDC

50 50

1000pF

100

VREF

Shown here is a demo of a 6-bit single slope TDC.

Sampling rate in this test is 22 MHz.

Both leading and trailing reference ramps are used in this example.

Nonlinear reference ramping is OK. The measurement can be calibrated.

TDC Implemented with FPGA





Multi-Sampling TDC FPGA c0

c90

c180

c270

c0

MultipleSampling

ClockDomain

Changing

Trans. Detection& Encode

Q0

Q1

Q2

Q3QF

QE

QD

c90

Coarse TimeCounter

DV

T0T1

TS

Ultra low-cost: 48 channels in $18.27 EP2C5Q208C7.

Sampling rate: 360 MHz x4 phases = 1.44 GHz.

LSB = 0.69 ns.

4Ch

Logic elements with non-critical timing are freely placed by the fitter of the compiler.

This picture represent a placement in Cyclone FPGA



TDC Using FPGA Logic Chain Delay

This scheme uses current FPGA technology

Low cost chip family can be used. (e.g. EP2C8T144C6 $31.68)

Fine TDC precision can be implemented in slow devices (e.g., 20 ps in a 400 MHz chip).

IN

CLK



Two Major Issues In a Free Operating FPGA

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

width (ps)

1. Widths of bins are different and varies with supply voltage and temperature.

2. Some bins are ultra-wide due to LAB boundary crossing



Digital Calibration Using Twice-Recording Method

IN

CLK

Use longer delay line. Some signals may be

registered twice at two consecutive clock edges.

N2-N1=(1/f)/ t

The two measurements can be used: to calibrate the delay. to reduce digitization errors.

1/f: Clock Periodt: Average Bin Width



TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5

VCCINT (V)

TDC Outputs

N1

n2

TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5

VCCINT (V)

TDC Outputs

N1

n2

Tc

Digital Calibration Result Power supply voltage

changes from 2.5 V to 1.8 V, (about the same as 100 oC to 0 oC).

Delay speed changes by 30%.

The difference of the two TDC numbers reflects delay speed.

N2

N1Corrected Time

)()(

0112

01 NNL

T

NN

NNTTc +=

−+

=

Warning: the calibration is based on average bin width, not bin-by-bin widths.



0

500

1000

1500

2000

2500

0 16 32 48 64

bin

Auto Calibration Using Histogram Method It provides a bin-by-bin calibration at

certain temperature. It is a turn-key solution (bin in, ps out) It is semi-continuous (auto update

LUT every 16K events)

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

DNLHistogram

In (bin)LUT

Out (ps)

16KEvents



Good, However

Auto calibration solved some problems However, it won’t eliminate the ultra-wide bins

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

width (ps)



Cell Delay-Based TDC + Wave Union Launcher

Wave UnionLauncher

In

CLK

The wave union launcher creates multiple logic transitions after receiving a input logic step.

The wave union launchers can be classified into two types:

Finite Step Response (FSR) Infinite Step Response (ISR)

This is similar as filter or other linear system classifications:

Finite Impulse Response (FIR) Infinite Impulse Response (IIR)



Wave Union Launcher A (FSR Type)

In

CLK

1: Unleash0: HoldWave UnionLauncher A



Wave Union Launcher A: 2 Measurements/hit

1: Unleash



Sub-dividing Ultra-wide Bins

1: Unleash

1

2

1

2

Device: EP2C8T144C6 Plain TDC:

Max. bin width: 160 ps. Average bin width: 60 ps.

Wave Union TDC A: Max. bin width: 65 ps. Average bin width: 30 ps.

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64 80 96 112 128bin

width (ps)

Plain TDC

Wave Union TDC A



FPGA TDC A possible choice of the TDC can

be a delay line based architecture called the Wave Union TDC implemented in FPGA.

Shown here is an ASIC-like implementation in a 144-pin device.

18 Channels (16 regular channels + 2 timing reference channels).

This FPGA cost $28, $1.75/channel. (AD9222: $5.06/channel)

LSB ~ 60 ps. RMS resolution < 25 ps. Power consumption 1.3W, or 81

mW/channel. (AD9222: 90 mW/channel)

In

CLK

Wave UnionLauncher A



More Measurements

Two measurements are better than one. Let’s try 16 measurements?



Wave Union Launcher B (ISR Type)

Wave UnionLauncher B

In

CLK

1: Oscillate0: Hold



Wave Union Launcher B: 16 Measurements/hit

1 Hit16 Measurements@ 400 MHz

VCCINT=1.20V

VCCINT=1.18V



Delay Correction

0

500

1000

1500

2000

2500

3000

0 4 8 12 16

m

16

32

48

64

0 2 4 6 8 10 12 14 16

m

Delay Correction Process: Raw hits TN(m) in bins are first calibrated into

TM(m) in picoseconds. Jumps are compensated for in FPGA so that

TM(m) become T0(m) which have a same value for each hit.

Take average of T0(m) to get better resolution.

The raw data contains: U-Type Jumps: [48-63][16-31] V-Type Jumps: other small jumps. W-Type Jumps: [16-31][48-63]

∑=

=15

000 )(

16

1

mav mtt

The processes are all done in FPGA.



The Test Module

Two NIM inputs

FPGA with 8ch TDC

Data Output via Ethernet

BNC Adapter to add delay @

150ps step.



Test ResultNIM Inputs

0 1 2

RMS 10ps

LeCroy 429ANIM Fan-out

NIM/LVDS

NIM/LVDS

-

140ps

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

+

+BNC adapters to add delays @ 140ps step.

Wave Union?

Photograph: Qi Ji, 2010Oct. 2010, Wu Jinyuan, Fermilab [email protected] 44


Coincidence in Trigger System



Parameters in Coincidence Finding

DiscEdge

DetectingDelay

PulseStretch

CoincidenceLogic

Disc DelayPulse

Stretch

Sampling

EdgeDetecting

Sampling

Some Details

Disc

EdgeDetecting Delay

PulseStretch

Sampling

Doublet Finding in Trigger System



DiscEdge

DetectingDelay

PulseStretch

CoincidenceLogic

Disc DelayPulse

Stretch

Sampling

EdgeDetecting

Sampling

Hit MatchingSoftware FPGA

Typical

FPGA Resource Saving Approaches

O(n2)for(){

for(){…}

}

O(n)*O(N)Comparator

Array

Hash Sorter

O(n)*O(N): in RAM

O(n3)for(){

for(){

for(){…}

}

}

O(n)*O(N2)CAM,

Hugh Trans.

Tiny Triplet Finder

O(n)*O(N*logN)

O(n4)for(){ for(){

for(){ for()

{…}

}}}

Example of Doublet Match, PET

Positrons and electrons annihilate to produce pairs of photons. The back-to-back photons hit the detector at nearly the same time.

Detector hits are digitized and hits at nearly the same time are to be matched together.

The process takes O(n^2) clock cycles.

T

D

T

D

Group 1

Group 2-

T<A?

T>(-A)?

Hash Sorter

K

K

D

K

D

Pass 1: Data in Group 1 are

stored in the hash sorter bins based on key number K.

Pass 2: Data in Group 2 are

fetched though and paired up with corresponding Group 1 data with same key number K.

Group 1

Group 2

The entire pairing process

takes 2n clock cycles, rather

than n2 clock cycles.



Conclusion Many things can be done in FPGA beyond our

imagination.

The End

Thanks

Delay Line Based TDC Architectures

HIT

CLK

HIT

CLK

HIT

CLK

HIT

CLK

Delay Hit Delay CLK Delay Both

CLK is used as clock

HIT is used as clock

Only this architecture needs dual coarse time counters.

Digital Phase Follower

c0

c90

c180

c270

c0In

MultipleSampling

ClockDomain

Changing

b0

b1

FrameDetection

DataOut

Tri-speedShift

Register

Shift2

Shift0

was3is0

SEL

was0is3

Trans.Detection

Q0

Q1

Q2

Q3QF

QE

QD

The input data rate is 1bit/clock cycle. Four clock phases, c0, c90, c180 and c270 are used to detect input transition edge. The phase for data sample follows the variation of the transition edge.

Schematics of Digital Phase Follower

EE[3..0]OUTPUT

C1OUTPUT

C0OUTPUT

PQQ[11..0]OUTPUT

DS5B[4..0]OUTPUT

BBOUTPUT

JMPOUTPUT

ENOUTPUT

IN1

CLK0

CLK90

CLK180

CLK270

EN

QQ[11..0]

BT

JMP

WTN

EE[3..0]

phtrk1

inst3

BB

BX

JMP

EN

CLK

Q[4..0]

C1

C0

DS5B

inst

GND

D[4..0]

C1

C0

CLK

M[23..20]

Q[27..0]

QQ[23..0]

DV

S[1..0]

ERR

Word24_13z

inst9

CLK0

VCCIN1 INPUT

VCCCLK0 INPUT

VCCCLK90 INPUT

VCCCLK180 INPUT

VCCCLK270 INPUT

EE[3..0]OUTPUT

QQ[11..0]OUTPUT

JMPOUTPUT

WTNOUTPUT

BTOUTPUT

CLRN

DPRN

Q

DFF

inst3

CLRN

DPRN

Q

DFF

inst4

CLRN

DPRN

Q

DFF

inst5

CLRN

DPRN

Q

DFF

inst6

CLRN

DPRN

Q

DFF

inst9

CLRN

DPRN

Q

DFF

inst10

CLRN

DPRN

Q

DFF

inst11

CLRN

DPRN

Q

DFF

inst12

NOT

inst27

AND4

inst29

PRN

CLRN

D

ENA

Q

DFFE

inst19CLRN

DPRN

Q

DFF

inst26

CLRN

DPRN

Q

DFF

inst21CLRN

DPRN

Q

DFF

inst24

OR4

inst8

AND2

inst13

AND2

inst14

AND2

inst15

AND2

inst16

CLRN

DPRN

Q

DFF

inst25

AND2

inst1

NAND2

inst2

CLRN

DPRN

Q

DFF

inst28

CLRN

DPRN

Q

DFF

inst30

CLRN

DPRN

Q

DFF

inst31

OR4

inst

CLRN

DPRN

Q

DFF

inst32

OR4

inst18

OR4

inst20

up countersclr

clockq[6..0]

lpm _counter1

inst7

QA[3]

QA[2]

QA[1]

QA[0]

CLK0

CLK90

CLK180

CLK270 CLK90

QQ[3]

QQ[2]

QQ[1]

QQ[0]

CLK0

QQN[6..3]

QQN[5..2]

QQ[4..1]

QQ[3..0]

AD[3..0]

QQ[7..0] QQN[7..0]

CLK0

QQ[3..0] QQ[7..4]

CLK0 CLK0

QQ[7..4] QQ[11..8]

EE[3]

EE[2]

EE[1]

EE[0]

QQ[11]

QQ[10]

QQ[9]

CLK0

QQ[8]

CLK0

CLK0

ADQ[0]

EE[3]

ADQ[3]

EE[0]

CLK0

AD[3]

CLK0

ADQ[3..0]

ADQ[1]

ADQ[0]AD[2]

CLK0

CLK0

ADQ[3]

ADQ[2]

ADQ[1]

ADQ[0] QCNT[6..0]

QCNT[6]

VCCBB INPUT

VCCBX INPUT

VCCJMP INPUT

VCCEN INPUT

VCCCLK INPUT

C1OUTPUT

C0OUTPUT

Q[4..0]OUTPUTdata1x[4..0]

data0x[4..0]

sel

result[4..0]

lpm_mux4

inst

PRN

CLRN

D

ENA

Q

DFFE

inst5

OR2

inst9

XOR

inst10

XOR

inst11

NOT

inst12

PRN

CLRN

D

ENA

Q

DFFE

inst6

PRN

CLRN

D

ENA

Q

DFFE

inst7

Q[2..0],BB,BX

Q[3..0],BBD[4..0] Q[4..0]

EN

CLK

CLK

EN

CLK

EN

JMP

CLK: 375MHz Data Rate:

375Mbits/s

The Two-Cycle Serial IO

This scheme is slower than digital phase follower but the logic is simpler. The CLK1 and CLK2 can be generated with two free running crystal oscillators.

CLK1

Data Out

Transmitter

Receiver

start bit = 1 b15 b14

b15start bit = 1 X b14X

CLK2

Data In

One data bit is transmitted every 2 clock cycles.

A logic transition is detected between these two falling edges.

Input data are stable at these clock edges.

Schematics of the Two-Cycle Serial IO

VCCCK200 INPUT

VCCDD[15..0] INPUT

VCCDRDY INPUT

VCCSDIN INPUT

VCCDV INPUT

VCCCK100 INPUT

QQ[15..0]OUTPUT

SDOUTOUTPUT

POPCMDOUTPUT

QQOKOUTPUT

VCC

GND

CLRN

DPRN

Q

DFF

inst4

up counterm odulus 36sclr

clockq[5..0]

cout

lpm _counterS2

inst3

CLRN

DPRN

Q

DFF

inst7

NOT

inst9

OR2

inst10NOT

inst11

NOT

inst12

CLRN

DPRN

Q

DFF

inst13

CLRN

DPRN

Q

DFF

inst14

NOT

inst16

CLRN

DPRN

Q

DFF

inst18

CLRN

DPRN

Q

DFF

inst19

AND4

inst20

NOT

inst17 up countersset 32sset

clock

cnt_en

q[5..0]

lpm _counterS4

inst2

AND6

inst22CLRN

DPRN

Q

DFF

inst23

left shiftload

data[16..0]

clock

enable

shiftin

shiftout

lpm _shiftregS1

inst

left shiftclock

enable

shiftinq[15..0]

lpm _shiftregS5

inst21

PRN

CLRN

D

ENA

Q

DFFE

inst1

CLRN

DPRN

Q

DFF

inst5

CLRN

DPRN

Q

DFF

inst24

OR2

inst15

vvv[31..0]

zzz[31..0]

CK200

CK200

DRDY

vvv[16],DD[15..0]

DV

ENA1

zzz[0]

CK200

ENA1

ENA1DV

CK200

CK200 CK200N

CK200

SEQ[0]

CK200

SEQ[0]

SEQ[5]

SEQ[4]

SEQ[3]

SEQ[2]

SEQ[1]

SDINQ

CK200N

SDIN

CK200N

CK200

CK200

SEQ[5..0]

SEQ[5]

SEQ[5]

CK200

SDIN SDINQ

CK200

CK100

CK200

434241403938373635343332

SDIN

SEQ

SDINQ

QQ

SD15 SD14 SD13 SD12 SD11 SD10

SD15 SD15,14 SD15..13 SD15..12

SSET

ENAS=SEQ[0]

SDIN1NQ

SDIN2NQ

CK200

CLK: 200MHz Data Rate: 100Mbits/s

The FM coding

A bit is transmitted in two unit time intervals, usually in two internal clock cycles at frequency f.

For bit=1, the output toggles each cycle, i.e., with frequency (f/2) and for bit=0, the output toggles every two cycles, i.e., with frequency (f/4).

When not transmitting data, the output toggles at frequency (f/4), until seeing the start bit. The data stream is naturally DC balanced suitable for AC coupled transmission. The polarity of the interconnection doesn’t matter.

0 start bit = 1 0 0 1 1

Schematics of FM Decoder

VCCCK212 INPUT

VCCINA INPUT

DVOUTPUT

DQ[17..0]OUTPUT

PQOUTPUT

CLRN

DPRN

Q

DFF

inst CLRN

DPRN

Q

DFF

inst2

CLRN

DPRN

Q

DFF

inst3

XOR

inst4

up countersset 8sset

clock

cnt_en

q[3..0]

lpm _counter1

inst5

data[2..0]

eq0

eq1

eq2

eq3

eq4

eq5

eq6

eq7

lpm _decode0

inst6

AND2

inst8

NOT

inst10

data[2..0]

eq0

eq1

eq2

eq3

eq4

eq5

eq6

eq7

lpm _decode0

inst11

up countersset 360sset

clock

cnt_en

q[8..0]

lpm _counter4

inst7

PRN

CLRN

D

ENA

Q

DFFE

inst1

NOT

inst9

AND6

inst12CLRN

DPRN

Q

DFF

inst13

CK212

CK212

CK212

INATOG

CK212

INATOG

TOGCNT[3..0]

TOGCNT[3]

INAQ

TOGCNT[2..0]

INAis0x

CK212

CNTSHFT

SSETFCNTSSETFCNTINAis0x

CNTSHFT

CNTSHFT,BitCNT[4..0],BTK[2..0] BTK[2..0]

OKSam ple

CK212

DQ[17..0],PQ

DD

OKSam ple

BitCNT[4]

OKSam ple

BitCNT[3]

BitCNT[2]

BitCNT[1]

BitCNT[0]CK212

DQ[16..0],PQ,DD

TOGCNT[2]

0 0

INAQ

INATOG

TOGCNT[2..0] 1 2 3 1 2 3 0 1 2 3 0 01 2 3 1 2 34 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3

SSETFCNT

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7BTK

CNTSHFT

OKSample

BitCNT 13 14

0 1 2 3 4 5 6 7

... 31

DV

DQ[17] DQ[16] DQ[0] PQ

Logic 0: INA:13.25MHz or 8xCK212

BitCNT: 13..31, Init to 13x8+256=260

CLK: 212MHz Data Rate: 26.5Mbits/s The ratio 8 CLK cycles/bit in this design is not an intrinsic limit.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

-1 0 1 2 3 4 5 6

The Clock-Command Combined Carrier Coding (C5)

A data train contains 5 pulses and each pulse is transmitted in four unit time intervals, usually in four internal clock cycles at frequency f.

Information is carried with wide, normal and narrow pulses and the first pulse is always wide or narrow.

When not transmitting data, all pulses have normal width. The data stream is DC balanced over 5 pulses suitable for AC coupled transmission. All leading edges are evenly spread so that the pulse train can be used directly drive the

receiver side logic or PLL.

Schematics of C5 Decoder

VCCCC INPUT

VCCT38 INPUT

VCCT58 INPUT

CmdValidOUTPUT

CmdBit[3..0]OUTPUT

Y[0..4]OUTPUT

NOT

inst

CLRN

DPRN

Q

DFF

inst3

CLRN

DPRN

Q

DFF

inst4

NAND2inst6

CLRN

DPRN

Q

DFF

inst7

CLRN

DPRN

Q

DFF

inst8

NAND2inst9

CLRN

DPRN

Q

DFF

inst10

CLRN

DPRN

Q

DFF

inst11

NAND2inst12

CLRN

DPRN

Q

DFF

inst13

CLRN

DPRN

Q

DFF

inst14

NAND2inst15

CLRN

DPRN

Q

DFF

inst16

CLRN

DPRN

Q

DFF

inst17

CLRN

DPRN

Q

DFF

inst18

NOT

inst19

AND2

inst20

DFFdata[3..0]

clock

enableq[3..0]

lpm _dff0

inst22

up counterm odulus 5sclr

clockq[3..0]

cout

lpm _counter0

inst27

BAND4

inst1

CLRN

DPRN

Q

DFF

inst21CLRN

DPRN

Q

DFF

inst23

Y[0]

Cm dBit[3..0]

Y[0..3]

Y[1]

Y[2]

Y[3]

Y[4]

VCCCC INPUT

VCCC40 INPUT

T38OUTPUT

T58OUTPUT

CLRN

DPRN

Q

DFF

instCLRN

DPRN

Q

DFF

inst1CLRN

DPRN

Q

DFF

inst2

NOT

inst3

VCCCC INPUT

Cy clone

inclk0 period: 36.000 ns

Operation Mode: Normal

Clk Ratio Ph (dg) Td (ns) DC (%)

c0 4/1 0.00 0.00 50.00

e0 1/1 0.00 0.00 50.00

inclk0 c0

e0

locked

altpll1

inst2

CC

C40

T38

T58

Delay

inst3

T38

T58

CC

Y[0..4]

CmdValid

CmdBit[3..0]

Composer

inst8

Data Rate: 36ns/bit or 27.7Mbits/s

Internal clock: 111MHz



Measurement Result for Wave Union TDC A

Histogram

Raw

TDC+

LUT53 MHzSeparate Crystal

-

-WaveUnion Histogram

Plain TDC: delta t RMS width: 40 ps. 25 ps single hit.

Wave Union TDC A: delta t RMS width: 25 ps. 17 ps single hit.

0

500

1000

1500

2000

2500

3000

3500

1000 1100 1200 1300 1400 1500

dt (ps)

Un-calibrated

Plain TDC

Wave Union TDC A

An Application in Liquid Argon Time Projection Chamber





Liquid Argon Time Projection Chamber

Passing charged particles ionize Argon. Electric fields drift electrons meter to wire chamber planes. Waveforms of all wires are digitized, which creates a large amount of data.

Drift TimeData from BO detector of FNAL

Induction #1

Induction #2

Collection

Wire Number



Data Reduction on Liquid Argon TPC Data

Hit waveforms in TPC carry useful information. Digitizing the waveforms creates large volume of data. Data reduction without losing useful information is necessary.

Drift Time

Wire Number

Data from BO detector of FNAL

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Serial Communication between FPGA Devices



Classical Picture of Serial Communications

The parallel data is converted to serial bits driven by crystal oscillator X1 in the transmitter device.

The serial data stream is used to generate a recovered clock at the receiver device with a phase lock loop (PLL).

The recovered clock is used to drive the serial-to-parallel converter and store the data into a first-in-first-out (FIFO) buffer.

The FIFO buffer is used to transfer data from the recovered clock domain to the local clock domain generated by crystal oscillator X2.

Parallel-to-SerialConverter

FIFOSerial-to-Parallel

Converter

PLLX1 X2

LocalLogic

Recovered Clock

Serial Data Receiving Without PLL etc.

Generating recovered clock with PLL, VCO, VCXO etc. is an analog process and it is not convenient to generate in an FPGA, especially for applications with multiple receiving channels.

There are pure digital methods to receive the serial data. Digital Phase Follower: 1bit/CLK The Two-Cycle Serial IO: 1bit/(2CLK) FM Encoder and Decoder: 1bit/(2-16CLK) Clock-Command Combined Carrier Coding (C5): 4bits/(20CLK)

The transmitter and receiver can be driven by two independent free running crystal oscillators.

Parallel-to-SerialConverter

DigitalSerial-to-Parallel

Converter

X1 X2

LocalLogic

SeeBackup Slides

Triplet Finding in Trigger System



Hits, Hit Data & Triplets

• Hit data come out of the detector planes in random order.

• Hit data from 3 planes generated by same particle tracks are organized together to form triplets.

• Three data items must satisfy the condition: xA+ xC = 2 xB.

• A total of n3 combinations must be checked (e.g. 5x5x5=125).

• Three layers of loops if the process is implemented in software.

• Large silicon resource may be needed without careful

planning: O(N2)

Triplet Finding

Plane A Plane B Plane C

Tiny Triplet Finder OperationsPass I: Filling Bit Arrays

Note: Flipped Bit Order

Physical Planes

Bit Array/Shifters

For any hit… Fill a corresponding logic cell.

• xA+ xC = 2 xB

• xA= - xC + constant

Tiny Triplet Finder Operations Pass II: Making Match

For any center plane hit…

Logically shift the

bit array.

Perform bit-wise AND in this range.

Triplet is found.

Physical Planes

Bit Array/Shifters

Tiny? Yes, Tiny! – Logic Cell Usage:

AM, CAM, Hough Transform

etc., O(N2)

Tiny Triplet FinderO(N*logN)

The triplet finding process for FPGA schemes takes 2n clock cycles. The Tiny Triplet Finder uses much fewer logic elements

Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns

C1

C2

C3

One set of coincident logic is implemented.

For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

Tiny Triplet Finder for Circular Tracks

*R1/R3

*R2/R3Triplet Map Output To Decoder

Bit

Arr

ay

Shifter

Bit

Arr

ay

ShifterBit-wise Coincident Logic

0

16

32

48

64

80

96

112

128

0 16 32 48 64 80 96 112 128

1. Fill the C1 and C2 bit arrays. (n1 clock cycles)

2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)

Also works with more than 3 layers

DIN DOUT

Index RAM

Pointer RAM

DATA RAM

K

Link List Structure of Hash Sorter

Hash Sorter

K

Using hash sorter, matching pairs can be grouped together

using 2n, rather than n2 clock cycles.

Conventional & Unconventional Applications of FPGA

Documents

Transcript of Conventional & Unconventional Applications of FPGA