Blackfin ADSP-21535 Versus Sharc ADSP-21061

Blackfin ADSP-21535Blackfin ADSP-21535VersusVersus

Sharc ADSP-21061Sharc ADSP-21061

By: David W. RasmussenBy: David W. Rasmussen

April 15, 2002April 15, 2002

To be covered today:To be covered today:

Quick overview of the architectures of the Quick overview of the architectures of the both the Blackfin and Sharc DSPsboth the Blackfin and Sharc DSPs

Main features of both processorsMain features of both processors

Main differences between the processorsMain differences between the processors

Code sample for an FIR on the Blackfin Code sample for an FIR on the Blackfin

Benchmark comparison of three major Benchmark comparison of three major DSP algorithmsDSP algorithms

Sharc ADSP-21061Sharc ADSP-21061[1][1]

DAG 2 8 x 4 x 24

DAG 1 8 x 4 x 32

CACHE

MEMORY

32 x 48

PROGRAM SEQUENCER

24

PMD

DMD

PMA

32DMA BUS DMA

48

40

JTAG TEST & EMULATION

FLAGS

FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT

ACCUMULATOR

32-BIT BARREL SHIFTER

FLOATING-POINT & FIXED-POINT

ALU

REGISTER FILE

16 x 40

BUS CONNECT

TIMER

PMABUS

PMD BUS

DMDBUS

Sharc’s Main FeaturesSharc’s Main Features[2][2]::

32/40-bit IEEE floating-point math 32/40-bit IEEE floating-point math 32-bit fixed-point MACs with 64-bit product 32-bit fixed-point MACs with 64-bit product and 80-bit accumulation and 80-bit accumulation No arithmetic pipeline; Thus all No arithmetic pipeline; Thus all computations are single-cycle computations are single-cycle Circular Buffer Addressing supported in Circular Buffer Addressing supported in hardware hardware 32 address pointers support 32 circular 32 address pointers support 32 circular buffers buffers 16 48-bit Data Registers16 48-bit Data Registers

Sharc’s Main Features Cont.:Sharc’s Main Features Cont.:

Six nested levels of zero-overhead looping Six nested levels of zero-overhead looping in hardware in hardware Four busses to memory (2 DM + 2 PM) Four busses to memory (2 DM + 2 PM) 1 Mbit on-chip Dual Ported SRAM1 Mbit on-chip Dual Ported SRAMMaximum processing of 50 MIPSMaximum processing of 50 MIPSPossibility of four parallel operations Possibility of four parallel operations processed in one clock cycleprocessed in one clock cycle +/-, *, DM, PM+/-, *, DM, PM Assuming Pipeline is fullAssuming Pipeline is full PM clashing – utilize Instruction CachePM clashing – utilize Instruction Cache

Blackfin ADSP-21535Blackfin ADSP-21535[3][3]

Blackfin’s Main FeaturesBlackfin’s Main Features[4][4]::Two 16-bit MACs, two 40-bit ALUs, and four Two 16-bit MACs, two 40-bit ALUs, and four 8-bit Video ALUs8-bit Video ALUsSupport for 8/16/32-bit integer and 16/32-bit Support for 8/16/32-bit integer and 16/32-bit fractional data typesfractional data typesConcurrent fetch of one instruction and two Concurrent fetch of one instruction and two unique data elementsunique data elementsTwo loop counters that allow for nested zero-Two loop counters that allow for nested zero-overhead loopingoverhead loopingTwo DAG units with circular and bit-reversed Two DAG units with circular and bit-reversed addressingaddressing300 MHz core clock performing 600 MMACs300 MHz core clock performing 600 MMACs

Blackfin’s Main Features Cont.:Blackfin’s Main Features Cont.:

Possibility of the following parallel Possibility of the following parallel operations processed in one clock cycleoperations processed in one clock cycle Execution of a single instruction Execution of a single instruction

operating on both MACs or ALUs and operating on both MACs or ALUs and Execution of two 32-bit Data Moves Execution of two 32-bit Data Moves

(either 2 Reads or 1 Read/1 Write) and (either 2 Reads or 1 Read/1 Write) and Execution of two pointer updates and Execution of two pointer updates and Execution of hardware loop update Execution of hardware loop update

Main Differences:Main Differences:

The Blackfin is only a 16-bit integer The Blackfin is only a 16-bit integer processor, however can operate on 32-bit processor, however can operate on 32-bit data values. If 32-bit data value used:data values. If 32-bit data value used: Either one or two ALU operations can be Either one or two ALU operations can be

performed in one clock cycleperformed in one clock cycle One MAC can be obtained however will take One MAC can be obtained however will take

more than one clock cyclemore than one clock cycle

The Sharc is a 32-bit Floating Point The Sharc is a 32-bit Floating Point processorprocessor

Main Differences Cont.:Main Differences Cont.:The Blackfin has 4 address registers (with The Blackfin has 4 address registers (with corresponding base, length, and modify) to corresponding base, length, and modify) to use for circular buffers versus the Sharc’s 32use for circular buffers versus the Sharc’s 32

The Blackfin has 2 nested hardware loops The Blackfin has 2 nested hardware loops where the Sharc has 6where the Sharc has 6

The Blackfin has an 8 stage pipeline (fetch 1-The Blackfin has an 8 stage pipeline (fetch 1-2, decode, execute 1-3, writeback) where the 2, decode, execute 1-3, writeback) where the Sharc has a 3 stageSharc has a 3 stage

The Blackfin is clocked six times faster (300 The Blackfin is clocked six times faster (300 MHz versus 50 MHz)MHz versus 50 MHz)

Blackfin FIR Code SampleBlackfin FIR Code Sample[5][5]::LSETUP(E_FIR_START,E_FIR_END) LC0=P1>>1; //Loop 1 to Ni/2LSETUP(E_FIR_START,E_FIR_END) LC0=P1>>1; //Loop 1 to Ni/2E_FIR_START:E_FIR_START:

R1=PACK(R1.H,R0.H) || [I0++]=R0 || R2.L=W[I2++];R1=PACK(R1.H,R0.H) || [I0++]=R0 || R2.L=W[I2++];//Store X1 into the lower half of R1.//Store X1 into the lower half of R1.//Update the delay line.//Update the delay line.//Fetch h0 into lower half of R2//Fetch h0 into lower half of R2

LSETUP(E_MAC_ST,E_MAC_END)LC1=P2>>1;//Loop 1 to Nc/2 - 1LSETUP(E_MAC_ST,E_MAC_END)LC1=P2>>1;//Loop 1 to Nc/2 - 1A1=R2.L*R1.L, A0=R2.H*R1.H || R2.H=W[I2++] || [I3++]=R3;A1=R2.L*R1.L, A0=R2.H*R1.H || R2.H=W[I2++] || [I3++]=R3;

//A1=h0*X1, A0=hn-1*X-n+1.//A1=h0*X1, A0=hn-1*X-n+1.//Fetch h1 into upper half of R2.//Fetch h1 into upper half of R2.//Store the output.//Store the output.

E_MAC_ST:E_MAC_ST:A1+=R0.L*R2.H,A0+=R0.L*R2.L || R2.L=W[I2++] || R0=[I1--];A1+=R0.L*R2.H,A0+=R0.L*R2.L || R2.L=W[I2++] || R0=[I1--];

//A1+=X0*h1, A0+=X0*h0//A1+=X0*h1, A0+=X0*h0//Fetch filter coeff. h2 into the lower//Fetch filter coeff. h2 into the lower//half of R2. Fetch X-1 and X-2 into the//half of R2. Fetch X-1 and X-2 into the//upper and lower half of R0 (for the //upper and lower half of R0 (for the //first time in this loop)//first time in this loop)

E_MAC_END:E_MAC_END:A1+=R0.H*R2.L,A0+=R0.H*R2.H || R2.H=W[I2++] ;A1+=R0.H*R2.L,A0+=R0.H*R2.H || R2.H=W[I2++] ;

//A1+=X-1*h2, A0+=X-1*h1//A1+=X-1*h2, A0+=X-1*h1//Fetch h3 into the upper half of R2.//Fetch h3 into the upper half of R2.//(for the first time in this loop)//(for the first time in this loop)

E_FIR_END:E_FIR_END:R3.H=(A1+=R0.L*R2.H),R3.L=(A0+=R0.L*R2.L) || R0=[P0++] || R1=[I0]; R3.H=(A1+=R0.L*R2.H),R3.L=(A0+=R0.L*R2.L) || R0=[P0++] || R1=[I0];

//A1+=X-n+2*hn-1, A0+=X-n+2*hn+2//A1+=X-n+2*hn-1, A0+=X-n+2*hn+2//Fetch the next pair of inputs (X2 and X3) into lower//Fetch the next pair of inputs (X2 and X3) into lower//and upper half of R0. Fetch X-n+2 and X-n+3 into R1//and upper half of R0. Fetch X-n+2 and X-n+3 into R1

......

Benchmarks:Benchmarks:

Algorithm TypeAlgorithm Type TimeTime CyclesCycles1024-pt complex FFT1024-pt complex FFT 0.37 ms0.37 ms 18,22118,221

FIR Filter (per Tap)FIR Filter (per Tap) 20 ns20 ns 11

IIR Filter (per Biquad)IIR Filter (per Biquad) 80 ns80 ns 44

For the Sharc[6]

Algorithm TypeAlgorithm Type TimeTime CyclesCycles256-pt Complex FFT256-pt Complex FFT 0.0106 ms0.0106 ms 3,1763,176

FIR Filter (per Tap)FIR Filter (per Tap) 13.33 ns13.33 ns 44

IIR Filter (per Biquad)IIR Filter (per Biquad) 20 ns20 ns 66

For the Blackfin[7]

Analysis:Analysis:

Blackfin is faster for the three algorithmsBlackfin is faster for the three algorithms

Unsure of exact performance gain on the Unsure of exact performance gain on the FFT (as different lengths) but is FFT (as different lengths) but is somewhere between 2-9 times fastersomewhere between 2-9 times faster

Both the FIR and IIR took more cycles to Both the FIR and IIR took more cycles to complete on the Blackfin as more cycles complete on the Blackfin as more cycles are required for 32-bit operations are required for 32-bit operations

ReferencesReferences

1.1. ENCM515 Lecture Slides for January 11, 2002, ENCM515 Lecture Slides for January 11, 2002, [http://www.enel.ucalgary.ca/People/Smith/2002webs/encm515_02/02presentation[http://www.enel.ucalgary.ca/People/Smith/2002webs/encm515_02/02presentations/02january/02overviewSHARCarchitecture.ppt], Dr. Mike Smiths/02january/02overviewSHARCarchitecture.ppt], Dr. Mike Smith

2.2. Sharc Architecture Overview, Sharc Architecture Overview, [http://www.analog.com/technology/dsp/Sharc/architecture.html], Analog Devices[http://www.analog.com/technology/dsp/Sharc/architecture.html], Analog Devices

3.3. DSP Manuals, DSP Manuals, [http://www.analog.com/library/dspManuals/pdf/21535/overview.pdf], Analog [http://www.analog.com/library/dspManuals/pdf/21535/overview.pdf], Analog DevicesDevices

4.4. Blackfin Architecture Overview, Blackfin Architecture Overview, [http://www.analog.com/technology/dsp/Blackfin/architecture/basics.html], Analog [http://www.analog.com/technology/dsp/Blackfin/architecture/basics.html], Analog DevicesDevices

5.5. FIR Blackfin Code Example, FIR Blackfin Code Example, [ftp://ftp.analog.com/pub/dsp/blackfin/examples/fir_032101.zip], Analog Devices[ftp://ftp.analog.com/pub/dsp/blackfin/examples/fir_032101.zip], Analog Devices

6.6. Sharc DSP Data Sheet, [http://www.analog.com/productSelection/pdf/ADSP-Sharc DSP Data Sheet, [http://www.analog.com/productSelection/pdf/ADSP-20161_L_b.pdf], Analog Devices20161_L_b.pdf], Analog Devices

7.7. Blackfin DSP Benchmark Comparison, Blackfin DSP Benchmark Comparison, [http://www.analog.com/technology/dsp/Blackfin/benchmarks/examples.html], [http://www.analog.com/technology/dsp/Blackfin/benchmarks/examples.html], Analog DevicesAnalog Devices

Special Thanks To:Special Thanks To:

Mike Roest for the use of his individual Mike Roest for the use of his individual assignment – entitled “Examination of the assignment – entitled “Examination of the Analog Devices Blackfin and SHARC Analog Devices Blackfin and SHARC 21061”, Submitted March 12, 2002 – as 21061”, Submitted March 12, 2002 – as preliminary research material for this preliminary research material for this report.report.

Blackfin ADSP-21535 Versus Sharc ADSP-21061

Documents

Transcript of Blackfin ADSP-21535 Versus Sharc ADSP-21061