Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single...

40
Xilinx DSP 1 Xilinx Core Solutions Group DSP

Transcript of Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single...

Page 1: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Xilinx DSP 1

Xilinx Core Solutions Group

DSP

Page 2: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Traditional DSP:DSP Processors

Multiply

Add

Single MAC

– One MAC (Multiply Accumulate)– Time-Shared– Performance ceiling

+ Programmable+ Off-the-shelf, standard part+ Hardware multiplier

SequentialProcessing

Page 3: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Xilinx DSPHigh Performance Alternative - Parallel Processing

Multiply

Add

Multiply

Add

Multiply

Add

Multiply

Add• • •

+ Programmable+ Off-the-shelf, standard part+ Many Multiplies in one clock cycle!+ Extend the performance of DSP Processors

Multiple MACs, Parallel Processing

Page 4: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Xilinx DSP Solution

• CORE Generator

System-LevelTools

• DSP LogiCOREs

• Tools Integration

Page 5: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Existing Xilinx DSP Design Methodology

COREGenerator

M1

XC4000X/Spartan/Virtex

CORE Generator

Parameterize DSP LogiCOREs

Connect the cores with HLD or schematic

Page 6: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Addition of DSP System Level Tool

DSP System level tools— Used by all DSP systems engineers— 100,000 copy installed base

Fit into existing DSP environment

Connect through the CORE Generator SystemLINX interface

SystemLevelTools

COREGenerator

M1

Page 7: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Performance

XC4085XL > 10x Faster than 320C6x

Bill

ion

s o

f M

AC

s p

er

Se

co

nd

4005XL 4013XL 4036XL 4062XL320C6x

1

2

3

4

5

4085XL

16-bit FIR Filter Benchmark

Page 8: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

• • •

REG

10 bits 1

2

8

32-Tap FIR

AdderTreeR

EG

10 bits

18-bitsREG

32-Tap FIR

32-Tap FIR

• • •

1

2

8

32-Tap FIR

32-Tap FIR

32-Tap FIR

120 Million Samples per Second512-Tap Decimating FIR

3.8 Billion MACs>10 DSP uPs

5,120 Flip-Flops— Just for

data buffer

XC4085XL 150,000 Gates

Page 9: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Lowest Cost C6x

XilinxXC4000XL

$0.25

$0.20

$0.15

$0.10

$0.05Pri

ce p

er M

illi

on

M

AC

s p

er S

eco

nd

Price

Page 10: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

DSP LogiCOREs Exploit FPGA Architecture

16-wordRAM

Matrix of 16 by 1 RAM primitives– Look-up-table logic– FIFOs, shift-registers, …– Multiple small memories

10,000 RAM primitives on a chipRegular, monolithic, scalable structureEfficient: 1 - 3 Million MACs per CLB

F/F

Page 11: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Distributed RAM & Distributed Arithmetic (DA):Perfect Match

4-InputLUT

4-InputLUT

ADDor

ACC.

Basic DA Structure MatchesXC4000 Architecture

N-bits

• • •

DA Algorithms:

• 4-Input Look-Up-Tables (LUT) Scaled with adders

• For higher performance Use more LUTs = more parallelism

• Efficiency similar to custom solutionAchievable with LUT logicMore ASIC gate equivalentsMore cost effective

• • •

Page 12: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Common DSP Functions Filters

— FIR— IIR

Transforms— FFT— DCT

Modulation— Multipliers— SIN tables

Basics— Multiply / add— Storage

Page 13: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

X

C0

X0

X

C1

X1

X

C2

X2

• • •

• • •

SAMPLE DATAN BITS WIDE

K TAPS LONGK SUM’s

OUTPUTDATA

SUM

FIR FILTER

FIR Filter

Page 14: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

1. Serial Distributed Arithmetic FIR– SDA FIR - Single Channel

– SDA FIR - Dual Channel

2. Parallel Distributed Arithmetic FIR

FIR Filter LogiCOREs

Two Basic Types:

Combine basic PDA or SDA FIR cores to solve many problems

Page 15: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

SDA FIR FiltersSerial Distributed Arithmetic

• Parallel In, Parallel Out, Bit-Serial Internally• All taps processed in parallel• Full precession through entire core• One clock cycle required for each data bit• One additional clock cycle for symmetric filters

EXAMPLE: 10-bit data, 80 taps, symmetrical FIR:

• For a bit level clock = 90 MHz• Max sample rate = 90 MHz / 11 clks = 8.2 Million samples/sec.• Process 80 taps every 122 nsec.• 656 Million MACs, 257 CLBs, 2.55 Million MACs / CLB

Page 16: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

SDA FIR Properties

• Coefficient bit-width determines size# CLBs = function of D.A. LUT width

• Data bit-width determines max sample rateOne serial clock per bit

• Output data width does not effect CLB count

For a Given # of Taps:

Page 17: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

What to Ask Data sample rate Number of taps Data word width Coefficient width Coefficient Symmetry Same input & output sample rate?

Number of CLBs

Page 18: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Serial Distributed Arithmetic Data Word = Coefficient Size:

# CLBs 5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit

8 tap Symm 33 36 39 42 45 52 55

Non 46 54 59 64 69 77 85

16 tap Symm 61 69 71 76 81 96 102

Non 80 95 104 112 123 138 142

24 tap Symm 89 101 108 116 127 146 154

Non 101 114 127 140 153 174 187

32 tap Symm 107 118 126 137 148 175 182

Non

40 tap Symm

Non

48 tap Symm 158 173 187 202 217 246 261

64 tap Symm 197 215 233 250 268 305 323

80 tap Symm

Sample Symm 13.3 8.9 7.3 6.2 5.3 4.7 4.2 3.8Rate Non 16.0 10.0 8.0 6.7 5.7 5.0 4.4 4.0

XC4000E-1 MHz MHz MHz MHz MHz MHz MHz MHz

Serial Distributed Arithmetic FIR Filters

5 bit 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit

53

80

93

116 138 154 165 179 191 226 239

236 257 278 299 320 364 385

Page 19: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Distributed RAM is More EfficientBuild the Time-Skew Buffer with Distributed RAM not Flip Flops

16 x 1 Shift Register

FF FF FF FF FF FF • FF• •

1 Logic Cell

16 Logic Cells

One 16x1 RAM Cell Primitive

FF

16 x 1 Shift Register

For SDA FIR Filters:

Page 20: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

0

400

800

1200

1600

De

vic

e S

ize

(LC

s)

16-Taps16-Bits

16-Taps8-Bits

64-Taps9-Bits

64-Taps16-Bits

SDA FIR Filters

Xilinx Distributed RAM - Uses One Third the Area

XilinxDistributed

RAM

BlockRAM

Best Device UtilizationDistributed RAM well suited to DSP

Page 21: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

PDA FIR Filter CoreParallel Distributed Arithmetic FIR Filters

• Fully parallel implementation• All taps processed in parallel (same as SDA)

• All bits processed in parallel

• Up to 100 million samples per second

• 2 billion MACs per 20-tap core

PDA FIR

Clock

Inputs Outputs

Data_IN DATA_OUT

CKCascadeData_Out

CascadeMid_Out

CascadeMid_In C_M_OUTC_M_IN

C_D_OUT

Page 22: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

PDA FIR Filters

• Parameterized• Input data: 4 to 24 bits

• Coefficients: 4 to 24 bits

• Symmetric, non-symmetric, negative symmetry

• Output data: 2 to 31 bits

• Taps: 2 to 20 per core

• Automatically trims unused coefficient ROMs

• Supports cascading multiple filter cores

The high data sample rate solution

Page 23: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

CORE Generator Software

LogiCORE:

AllianceCORE:Data Sheets

Web Mechanism to download new cores

SystemLINX: Ability to call CORE

Generator from Third Party Tools

Page 24: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

One lineDocumentation

Page 25: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

CORE Generator Methodology

1. Select a CORE

2. Enter parameters

3. Generate Core

Page 26: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

160 CLBHOW ?

LogiCORE - SDA FilterFilter Design

Package

Page 27: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

DSP CORE Generator Outputs Schematic symbol

VHDL or Verilog HDL instantiation code

Simulation model

Design netlist with constraints

20 rows by 9 columns160 CLBs used

32 Tap FIR Filter

Predictable Performance regardless number of cores

DSPCORE

Generator

FIR FilterRecipe

Parameters

Page 28: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Predictable Size & Performance• Built for System Performance - Not Benchmarks.• Generated with RPM (Relationally Placed Macro).

RPM Macro LevelAdvantages RPM System Level

Advantages• Predictable size.

• Close proximity of communicating elements

• Alignment of Critical paths

• Accessible I/O signals

• Improves Density

• Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!)

• Consistent performance anywhere on the die.

• Packing density very high

• Adequate set-up times

Filling a device with Xilinx Cores does not reduce performance

Page 29: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Same core installed in different locations

80 MHz

80 MHz

Performance Independent of core location

Xilinx LogiCOREs deliver the same performance for any placement

Non-segmented routing FPGAs can’t do this

Page 30: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

80 MHz

80 MHz80 MHz

80 MHz

Performance Independent of Device Utilization

Xilinx has performance independent of the number of cores added Non-segmented routing FPGAs can’t do this

Page 31: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

40

50

60

70

80

1 2 3 4 8

12x12 Area Efficient Multiplier

Number of Instances

Sp

eed

(M

Hz)

. . . . . .

NonSegmented

Segmented = More Predictable and Repeatable

XilinxSegmented

Best FPGA PerformanceXilinx is more Predictable

Page 32: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

80 MHz 80 MHz 80 MHz

Performance Independent of Device Size

Same performance for a 4005 or 4085 Non-segmented routing FPGAs can’t do this

Page 33: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Design Flow

~~~ 4:1

ComplexDemod

~~

4:1

32-TAP FIRDecimate

48-TAPFIR

4K x 16RAM

Base-band processor

I

Q

COS

SIN

20 MHz

4 multipliers

5 MHz

Low Pass

~~

~~

~~

• Generate each module.• Use Schematic or HDL at a system level.

Mixer

Page 34: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Implementing the Mixer

This mixer supports sample rates in excess of 85MHz. It even supports sample rates up to 45.6MHz using the slowest Xilinx device(E-4)

Page 35: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Joining the CoresHere VHDL is used to link the cores into a system. Schematic symbols may also be used.

skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk);

form_sine_address:for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10);end generate form_sine_address;

sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, --select SINE output when high

c => clk);

All component declaration andport map code provided by Coregen

Page 36: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Power Dissipation AdvantageOften the Limiting Factor In DSP

Xilinx Advantage over competitive FPGAs— Segmented routing is essential in DSP applications— Altera Runs 3X HOTTER than Xilinx!

Xilinx advantage over DSP processors:— TI Runs 2X HOTTER 320c6

– Independent study by Stanford

STOP

Too MuchHeat

Page 37: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

0

5

10

0 20 40 60Clock Frequency (MHz)

Po

we

r (W

)

80 100

Segmented = Lower Power, Faster Operation

Ceramic

PlasticNon

-Seg

men

ted

Xilinx Segmented

Segmented Interconnect YieldsLower Power

PackageThermal

Limit

Page 38: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

FIR FilterCORE

100 Million

Samples / sec.

Where to find opportunities Look for high performance applications

— Multiple DSP processors— Fixed function DSP parts— Gate array / custom DSP

Data rates typically above 1 MHz Multiple channels required

Page 39: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

DSP Applications

Image &Video Processing Communications Industrial, Military

Medical ImagingCopiersCamerasSecurity SystemsVideo editorsInspection SysFingerprint ID

Motor controlNumerical controlTest equipmentVibration analysisPower suppliesRadarSecure comm.

Wireless CommCellular / PCSModems

SatelliteCableADSL

Telephone Test

Page 40: Xilinx DSP 1 Xilinx Core Solutions Group DSP. Traditional DSP: DSP Processors Multiply Add Single MAC – One MAC (Multiply Accumulate) – Time-Shared –

Where FPGA Solutions Fit

FPGAs ideal for high sample rates and computational intensity

MHz sample rates

FPGAs

Fixed-point arithmetic

kHz sample ratesSingle channel

Processors

Fixed-point arithmetic

ProcessorsFloating-point arithmetic

Audio RF, Video, Multiple Channels