Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History...

P i i l bl l tiP i i l bl l tiPromising low power reusable solutions: Promising low power reusable solutions: Application Specific InstructionApplication Specific Instruction--set Processors set Processors pp ppp p

Myung Hoon SunwooMultimedia Comm. SoC Lab. Ajou University, Korea

Ajou Univ. SOC Lab.MultimediaCommunications1 / 75

Outline

What is ASIP? and Why ASIP?

SPOCS (Signal Processors for OFDM Communications)SPOCS (Signal Processors for OFDM Communications)SPOCS Architecture for FFT and Bit Manipulation

Performance Comparisons and Implementations

DASIP (Digital Audio Specific Instruction set Processor)Proposed Instructions and Coprocessor

Proposed Inverse Quantization Algorithm

VSIP (Video Specific Instruction set Processor)Proposed Instructions and Coprocessors

Performance Comparisons

Trends of recent ASIPsApplications of Low power ASIPs

ASIP design technologies

Conclusions


Outline










Conclusions


What is ASIP?DSP

Disadvantages : L P f /

Multi-StandardMultimedia & Communications

Low Performance/High Power Consumption

WLAN

Ad t f ASIC

Advantages : Programmability,

Flexibility4G Wireless

Communication

Advantages of ASIC + Advantages of DSP ASIP

Advantages :

DVB, DAB

Disadvantages :

Advantages :Optimization, Low Power,

High Performance H.264/AVC

ASIC

Disadvantages : High Development Cost,

Low Flexibility, Long Time to Market

DMB

Ajou Univ. SOC Lab.MultimediaCommunications

ASIC

4 / 75

What is ASIP?

Changes of System Design EnvironmentSh t Ti t M k tShort Time to MarketFrequent Spec. Changes27% CAGR(Compound Annual Growth Rate) of DSP Market

16

18

10

12

14

$

4

6

8$B

year 0

2

2002 2003 2004 2005 2006 2007 2008 2009

S F d C t F b 2005


Source: Forward Concepts, February 2005

5 / 75

Why ASIP?

Computational Efficiency and Flexibility

GeneralPurpose Digital

Signal

StrongARM1100.4MIPS/mW

TMS320C54x3MIPS/mW

exib

ility

Processors SignalProcessors Application

Specific Instruction setProcessors

Application

Fle

PhysicallyOptimized

ApplicationSpecific

ICs

Performance

OptimizedICs

Determine the Best Choice between Flexibility vs. PerformanceHigh Performance and Flexibility System

Source: T. Noll, RWTH Aachen


g y yApplication Specific Instruction set Processors

6 / 75

What Resources in SOC

Digital signal processors Hardware-independent SoftwareDigital signal processorsMicroprocessorsASIPs

Hardware-independent Software

Applications

User definedI f

Libraries Middle

Various MemoriesPeripheral, InterfaceP bl C

Interface

Hardware-Dependent Software

Operating Systems

ware

Programmable CoresA/D, D/A, AnalogRTOS

Operating Systems (Kernel)

Device Drivers

RTOSMiddle WareApplication SW

Hardware

Analog

CPUCore

DSPROM

MPEG Cache

DRAM

Logic

Etc.Analog DSPROMDRAM


SOC Challenges

Reuse Technology

Block Based Design

Platform Based DesignMethodology

Timing Driven DesignMethodology

Block Based DesignMethodology

SRAM

Methodology

ReusableμP core

ROM

ROMATMData Cache

S i l I/F

SRAM

ROMμP core

Logic

CustomerDefined

Logic

Logic

MPEG RAM

Serial I/F

LogicSoft I/F IP

LogicLogic


Cited from “Surviving the SOC Revolution,” Chang et al., Kluwer Academic Publishers

8/ 75

Microprocessors vs. Digital Signal ProcessorsDigital Signal Processors

History of Microprocessors

ConvergingConverging


Microprocessors vs. Digital Signal ProcessorsDigital Signal Processors

History of DSPsy

Diverging

Hundreds of DSPs

(In-house)


Design flow of ASIP

Target ApplicationSelection SPOCS DASIP VSIPSelection

Application Profiling

SPOCS DASIP VSIP

WLAN MPEG – 2/4 AAC H.264/AVC

H/W, S/W Partitioning

Design Special Instructions

and Architecture

Design Hardware Accelerators

FFT, Bit operation

IMDCT,Huffman decoding

ME/MC,VLC

Verification and

and Architecturep Huffman decoding VLC

FPGA board LISA simulator C/Matlab programPerformance Comparison

Chip Fabrication

FPGA board, LISA simulator, C/Matlab program


Design flow using LISATek tools

ApplicationAdjust Generate

LISA 2.0 DescriptionLISATek

Processor Designer

Application

C-Compiler

Assembler

LinkerD i lSimulator

Architecture

Design goalsmet?

NoArchitecture

Debugging & Profiling

RTL Generation BuildYes

RTLImplementation

SoftwareTools

ConvergenSCSystemC

Analyze

Ajou Univ. SOC Lab.MultimediaCommunications12/75

p(Verilog, VHDL,SystemC)

yModels

Software tool developmentDisAssembly Assembly code

< LISATek Development Environment >

< Assembler / Linker > < Simulator >Register Memory Pipeline


< Assembler / Linker > Simulator

13 / 75

HW/SW verification environment

Compare FPGA board, C / Matlab, Lisa simulatorp , ,Reduce the ASIP development time

C simulator FPGA results

Ex) Verification of IMDCT of DASIP

Lisa simulator

Matching !!


Outline










Conclusions


Signal Processors for OFDM Communication Systems (SPOCS)Communication Systems (SPOCS)

PCU Program

SPOCSFFT calculation problem of General DSP

PCU(Program Control

Unit)

ProgramMemory

Do/Loop instruction => additional cycle neededInefficient Butterfly calculation (Fixed MAC structure)

AGU(Address Generation

FAGU(FFT AGU)

FFT #N (Instruction)Input data address decision

(Address Generation Unit)

DPU

(FFT AGU) Addr.offset

Address generation (automatically)Reduce address generation time

DataMemory

(Data Processing Unit)

DSP FFT calculation cycleCarmel DSP (N+10)log2N + 5N/4- 4

TMS320C62X(4N/2)log2N +

BMUTMS320C62X 2

7log2N + N/4 + 9

SPOCS (2N/2)log2N + 9 * N : FFT point

DPU(Data Processing

Unit)

BMU(Bit Manipulation

Unit)


pSPOCS : application specific signal processor for OFDM communication systems [Jour. Of signal proc., 2008].Design of new DSP instructions and their hardware architecture for high-speed FFT [Jour. of VLSI signal proc., 2003].

16 / 75

SPOCS architecture

Proposed DPU Architecture Butterfly Calculation flow

Adder3Mul MulP1 P2 Acc3

Cycle 1(SBUTTERFLY)

Cycle 2(ABUTTERFLY)

Switching Logic

Adder1 Adder2Acc1 Acc2

2MAC/1ALU

SPOCS FFT Calculation

DPU ArchitectureFixed MAC of Existing DSP add Switching Logic : Support MUL-MUL-SUB(ADD), ADD-SUB Operation per CycleFFT Instruction

Existing DSP : Many Instructions Using (DO, ADD, SUB, Load, Store, MAC etc.) FFT, SBUTTERFLY, ABUTTERFLYSupport Various Instructions


51 Instructions including New Instructions

17 / 75

SPOCS bit manipulation operations

MotivationVarious communication systems have been developed, such as xDSL, WLAN,

DMB, IMT2000, etc.These systems have similar bit manipulation functions.

ScramblingConvolutional

Encoding/Puncturing

Interleaving Modulation

BasebandChannel

Sync/ViterbiDescrambling

BasebandData

yDemodulationDeinterleaving

ViterbiDecodingDescrambling


Basic bit manipulation operations

ScramblingN th Output decided by XOR operations of

Input

N-th Output decided by XOR operations of input bit and N-th shifted data according to generator polynomialGenerator Polynomial = X7 + X4 + 1

Output

R0R1R2 R3R4R5R6

Shift XOR operations

Output A

C l ti l E di

Shift, XOR operations

Input

Output B

R0 R1 R2 R3 R4 R5 Convolutional EncodingOutputs derived by XOR operations of bits in the shift register decided by encoder structure

Input

Generator Polynomial = X7 + X4 + 1 Shift, XOR operations

A4A3A2A1A0 B0B1B2B3B4 Bit Stream MultiplexingCombining two bit streams as an alternate order

A4A3A2A1A0 B0B1B2B3B4

A2 A1 A0B0B1B2


B7 A7 B6 A6 B5 A5 B4 A4 B3 A3 B2 A2 B1 A1 B0 A0 Bit Stream Multiplexing

19 / 75

Basic bit manipulation operations

Input AInput A

Input B PuncturingDeletes some of the encoded bits according to

ttOutput

patterns

Bit Insert and Extract OperationsOperations

InterleavingShuffling input bits

Bit Insert and Extract Operations


SPOCS bit manipulation Instructions

Existing DSP (Puncturing, Interleaving) SPOCS (Puncturing, Interleaving)

Input DataShift LeftShift Right0 0 0 0 0 0

Input Data

Bit ExtractProgrammable Switchg

0 0 0 0 0

0 0 0 0 Data Generation

OR OperationBit Load Register :Load the Extracted BitData Generation1 Cycle Operation0 0 0 0 Data Generation 1 Cycle Operation

Existing DSP (Scrambling, Convolution) SPOCS (Scrambling, Convolution)

Input DataALU : XOR Operation Input Data

Existing DSP (Scrambling, Convolution) SPOCS (Scrambling, Convolution)

Shifter : Shift

Shifter : ShiftALU : XOR Operation

ALU : XOR OperationBMU : Maximum 9 DataCan Be Shifted and XOR1 Cycle Operation


FFT performance of SPOCS

Key Features

Proposed Instructions for FFT Calculation FFT ABUTTERFLY SBUTTERFLYProposed Instructions for FFT Calculation - FFT, ABUTTERFLY, SBUTTERFLY

FAGU – Automatically generate Data addresses (Very Fast FFT Operation)

Reduce Program Memory Accesses (Only three instructions) => Very Low Power

Standard FFT point Time limit (µs) SPOCS time (µs)

WLAN (54Mbps) 64 4 1.4

DAB512 62 16.5

2048 256 80.5

DVB-T 2048 231 80 5

Meet Various Communication Standards

DVB-T 2048 231 80.5

VDSL 4096 250 174.5

Implementation of application-specific DSP for OFDM systems [IEEE ISCAS2004].FFT operating apparatus of programmable processors and operation method thereof[US/European patents].Digital signal processor architecture with bit manipulation accelerator for communication


Digital signal processor architecture with bit manipulation accelerator for communicationsystems [EURASIP JASP, 2005].Bit manipulation operation circuit and method in programmable processor [US patents].

22 / 75

OFDM performance of SPOCS

PerformanceCarmel DSP TMS320C62X SPOCS

DSP Structure VLIW VLIW Application Specific DSP

Hardware Size VLIW (N.A.) VLIW (N.A.) 107,000 Gates + 12Kbyte Memory( ) ( ) , y y

DPU Structure 2MAC/2ALU 2MUL/6ALU 2MAC/1ALU

Cycles/Butterfly 2 4 2

Calculation Time (FFT)64-point 520 835 393

256-point 2,452 4,225 2,057

1024-point 11,616 20,815 10,249

2048 point 25 194 45 654 22 5372048-point 25,194 45,654 22,537

StarCore SC140 TMS320C62X SPOCS

Operation 4 Shift / 4 Logical Operation BMUOperation 4 Shift / 4 Logical Operation BMU

Convolution (IS-95) (K=9, R=1/2, 192 bits) 463 N.A. 152

Block Interleaving (802.11a) (16 * 6 bits) 414 N.A. 91

Scrambling (802.11a) (12Mbit/s) N.A. 39 X 106 20 X 106


Convolution (802.11a) (12Mbit/s) N.A. 77 X 106 12 X106

23 / 75

SPOCS implementation

iPROVE Xilinx xc2v6000

SPOCS Core Design FPGA Implementation

SEC 0.18um Synthesis (Synopsys)• Gate : 107,000• Program Memory : 4 Kbyte, Data Memory : 8 Kbyte• Frequency : 290MHz

iPROVE Xilinx xc2v6000Emulate IEEE 802.11a WLAN

Special Instruction Set for FFT Operation and BMU InstructionsC t OFDM C i ti t d d

Frequency : 290MHz


Can meet OFDM Communication standards

24 / 75

SPOCS implementation

Macro Libraries for IEEE 802.11aScrambling (Descrambling)DO #end, @R3SCB GR7, #0x0cMOV2 @R1, ACC0 | @R4, ACC1PUNC ACC1 GR2L

Mapping (Demapping)start of 64 QAM mapping

MOVI #0x0000,R3 * Q-channel inputPUNC ACC1, GR2LMOV2 @R2, ACC0 | @R5, ACC1PUNC ACC1, GR3Lend:

Convolution Encoding

MOVI #0x0000,R3 Q channel inputMOVI #0x0050,R4 * I-channel input MOVI #0x0090,R1 * to loop MOVI #0x0030,GR7 * #48 loopingMOVE GR7,@R1

DO # d f1 @R1Convolution EncodingDO #ENDDO, @R4

MOVEC R5, GR7MOVE @R0+, GR2

DO #endof1,@R1MOVE @R3,GR0 * to change value two's complimentMOVI #0x0003,ACC0 * make ACC0 011 to get last 2bits of GR0AND GR0,ACC0 * get last 2bits of GR0MOVEC ACC0,GR1 * store the value of ACC0MOVI #0x0002,ACC0 * make ACC0 010 to compare with GR1

CONV GR0, GR2, GR3, GR4CONV GR1, GR2, GR3, GR5MOVE @R1+, ACC0

Interleaving (Deinterleaving)

p

IFFT (FFT)Interleaving (Deinterleaving)DO #loop1, @R5DO #label1, @R1MOVE @R0+, GR0label1: PUNC ACC1, GR0L, GR6

( )MOVI PSW 0x4000 -- PSW setting scale downMOVI M0 0x000A -- Xmem base = 10MOVI R7 0x000A -- Ymem base = 10IFFT #256SBUTTERFLYABUTTERFLY


, ,ROL GR6, ACC0MOVEC GR4, R0

ABUTTERFLY

25 / 75

SPOCS implementationHW/SW Verification Environment using FPGA, Matlab, Lisa simulator


Outline










Conclusions


Digital Audio Specific Instruction set Processor (DASIP)Processor (DASIP)

Audio Applications

High Speed IMDCT

High Speed Parallel Execution

DOLBY (AC3)DOLBY (AC3)

Parallel Executionof Huffman Decoding

DTS 96/24DTS 96/24

MPEG AACMPEG AAC

High

ApplicationSpecific MP3PROMP3PRO

ASIP for Audio Applications

HighPerformance

AAC

Instruction Setfor Audio Algorithm

MP3PROMP3PRO

OGG, WMAOGG, WMA


Digital Audio Specific Instruction set Processor (DASIP)

Register files including 32 registersProgram control unit, data processing unit, address generation unit

Processor (DASIP)

Program control unit, data processing unit, address generation unitHuffman accelerator for MPEG-2/4 AAC2 ROM tables and 2 Data Memories

ControlP C t l U it Program

Register Program Control Unit ProgramMemory

DataProcessing

Unit

AddressGeneration

UnitRegister

ROMTABLE

Data

ROMTABLE

Data

Huffmanaccelerator

filesData

MemoryData

Memory

Design of a high-quality audio-specific DSP core [Best Paper Award in IEEE SIPS 2005].


Computing circuits and method for running an MPEG-2 AAC or MPEG-4 AAC audio decodingalgorithm on programmable processors [US and Korea patents].

30 / 75

Complexity of the MPEG-2 AAC decodingdecodingHigh computational loadsHigh computational loads

Filterbank IMDCT(Inverse Modified DCT)Huffman decoding Compare & Program controls

FilterbankHuffman DecodingI Q t & l

4 1%

33%

Inv-Quant & scaleEtc.

16%

4.1%

16%

48%


Fast IMDCT Algorithm

The fast algorithm efficiently reduces the computational loads g y pof overall system by a factor of about 10 Using N/4-point complex IFFT

( )X k ( 2 1) (2 )2NX k j X k− − + ⋅

2 1( )8

j nNeπ

⋅ +×

2 1( )8

j kNeπ

⋅ +×

( )x n


Proposed instructions for IMDCT

X(k) LDPRE instruction

Pre-processing LDPRE, ST2 • 4 data transfers (load)• IAMU• Support parallel loads

N/4 IFFT LD4 instruction

pp p

Post-processing LD4, ST2

• 4 data transfers (load)• High data bandwidth• Support parallel loads

Data de-

interleavingLD4, ST2

pp p

ST2 instructioninterleaving

x(n)

• 2 data transfers (store)• High data bandwidth• Support parallel stores


Huffman decoder

Bitstream parser Specific Instructions for Huffman decoding

General Reg.

Huffman book select

Accumulator

HFMD GR0, GR1, Acc0, GR[n]GR0 index(9bit) of [Acc0]GR1 code length(5bit) of [Acc0]

▪ Gate Count : 3800 gates

<Special Feature>HFMD

g ( ) [ ]

▪ Index value directly loaded to RegisterHuffman decoder

Processor Computation CycleTMS320C62x N. A. (Very large)

Korean DSP 5 cycles

General Reg. General Reg.

Korean DSP 5 cycles

ASIC 2.5 cycles

Ajou ASIP 2 cycles<Performance Comparisons of Huffman Decoding >


index Code length<Performance Comparisons of Huffman Decoding >

34 / 75

Proposed inverse quantization algorithm

4 43 3( 8) ( ) 16

8 8X XX = × = × Features

43

(1) 1 256,

: ( )

from X to

X LUT X

=

=

1. Require 256 LUT

2. Consist of 4 stages

3 No computation requires atRemainder Function

①

443

16

(2) 257 2047,

(401 [ ])8: 2( ([ 1]) ([ ]) ) ( ) ([ ]) 2

8 8 2 8 8

from X toX

X X X XX LUT LUT rem LUT

=

−= + − − × + ×

3. No computation requires at

the first stage

4. All of multiplications and ②

(3) 2048 8191,

: ( ) 32,64

from X toXif rem

=

≤

divisions can achieve by

only shift operations

5. The positive and negative

(1)③

43

12

(218 [ ])644( ([ 1]) ([ ]) ) ( )

64 64 2 64

XX X XX LUT LUT rem

−= + − − × + 8([ ]) 2

64

: ( ) 32,

XLUT

Xif rem

×

>

errors have almost same

distribution (It can reduce

error accumulation)(2)④

483

12

: ( ) 32,64

(218 [ ])644( ([ 1]) ([ ]) ) ( ( ) 64) ([ ]) 2

64 64 2 64 64

if rem

XX X X XX LUT LUT rem LUT

>

−= + − + × − + ×

(2)④


(3)Gauss Function

35 / 75

Proposed architecture

EXTB instructionThe rem(X/N) and the gauss[X/N] functions in one cycle

Syntax EXTB ACC0, GR0, #N

( ) g [ ] yThe syntax of the EXTB instruction The operation of the EXTB2

Description ACC rem ( GR0 / 2N ) when N<0

Description ACC [ GR0 / 2N ] when N>0Description ACC [ GR0 / 2 ] when N>0

Implementation Results (Instruction count)

Can reduce computational loads

Processor ARM TI 54X DASIP

Direct linear interpolation algorithm 29 27 21

Implementation Results (Instruction count)

Tsai algorithm 61 57 47

Proposed algorithm 49 46 38


T. H. Tsai and C. C. Yen, “A High Quality requantization quantization method for MP3 and MPEG-4 AAC audio coding,” in Proc. IEEE Int. Symp. On Circuits and Syst., 2002, pp. 851-854

36 / 75

Proposed inverse quantization algorithm

Error graph of the proposed IQ method

Proposed method vs. Direct method

ErrorDirect

Method(256)Korean

256(2001)

Taiwan256

(2003)

Taiwan128

(2003)

The proposed Algorithm

256Max. error(257-2048) 0.08728 0.04365 0.02538 0.03669 0.048115Max. error(2049-8191) 1.39655 0.69832 0.35389 0.58217 0.323076

Average error 0.41979 -0.20990 0.03161 0.16233 0.0079631


Novel non-linear inverse quantization algorithm and its architecture for digital audio codecs [IEEE ISCAS 2007].

37 / 75

Outline










Conclusions


Video Specific Instruction set Processor (VSIP)Processor (VSIP)

Video Applications

JPEG 2000JPEG 2000

Special Features for

HuffmanME/MC CoprocessorParameterized JPEG 2000JPEG 2000Parameterized,

Highly Parallel Architecture

MPEG 2/4MPEG 2/4

H.264/AVCH.264/AVC

ASIP for VideoApplications

Optimized DALUApplication

Specific H.264/AVCH.264/AVCOptimized DALUfor

Integer DCT, Loop Filter

Instruction Setfor VideoAlgorithm


Video Specific Instruction set Processor (VSIP)

DSP Core

Processor (VSIP)

H.264 Decoding (%)MC

In-Loop filter

VLC

Color converter

Inv. Transform/Q

DSP Core

PCUProgramS ifiS ifi

Q

Intra Prediction

Decode MV

Other DPU

Programmemory

Data

Specific Specific InstructionsInstructions

AGU

Datamemory

H.264 Encoding (%)

Motion Estimation

Intra Prediction

In-loop filter ME/MCCAVLC/UVLC

CoprocessorCoprocessor

Transform/Q CoprocessorCoprocessor


ASIP Instructions and their hardware architecture for H.264/AVC [Journal of Semiconductor Technology and Science, 2005.12]

40 / 75

H.264 computation characteristic

Deblocking filtering Intra prediction

p’0=(p2+2*p1+2*p0+2*q0+q1+4)>>3p’1=(p2+p1+p0+q0+2)>>2p’2=(2*p3+3*p2+p1+p0+q0+4)>>3

– a is predicted by (A + 2B + C + I + 2J + K + 4) >> 3

– b, e are predicted by (B + 2C + D + J + 2K + L + 4) >> 3

c f i are predicted by (C + 2D + E + K + 2L + M + 4) >> 3p ( p p p p q )

p’0=(2*p1+p0+q1+2)>>2p’1=p1p’2=p2

– c, f, i are predicted by (C + 2D + E + K + 2L + M + 4) >> 3

– d, g, j, m are predicted by (D + 2E + F + L + 2M + N + 4) >> 3

– h, k, n are predicted by (E + 2F + G + M + 2N + O + 4) >> 3


p’2=p2 – l, o are predicted by (F + 2G + H + N + 2O + P + 4) >> 3

41 / 75

Proposed instruction

Packed Instruction

8-bit 8-bit 8-bit 8-bit8-bit 8-bit 8-bit 8-bit

8-bit 8-bit 8-bit 8-bit

Existing packed instruction Packed instructionExisting packed instruction Packed instruction required for H.264


Integer transform

Integer transform matrix

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

⊗

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

−−−−

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

⎥⎥⎥⎥⎤

⎢⎢⎢⎢⎡

−−−−

=⊗=2/2/4/2/4/2/2/2/

21112111

1121

11112112

1111

)( 22

22

22

abaabababbababaaba

XECXCY T

Operation flow of 4x4 integer transform21d

52b

21

≅≅=a⎥⎥⎦⎢

⎢⎣

⎥⎦

⎢⎣ −−⎥⎦

⎢⎣⎥⎦

⎢⎣ −− 4/2/4/2/11211221 22 babbab

Operation flow of 4x4 integer transform

x(0)

x(1)

X(0)

X(2)- -

x(0)

x(1)

X(0)

X(2)

-22

x(2)

x(3)

X(1)

X(3)

-

-

1/2

1/2-

- x(2)

x(3)

X(1)

X(3)


-

1D Forward Transform 1D Inverse Transform43 / 75

Proposed instructions

fTRAN, iTRANForward /Backward Transform4 x 1 1D transform for 1 cycle 2 input operands, 1 output operandT d f 16 16 bl k d thTwo modes for 16x16 blocks and others

Operation AssemblyADD R0(0), R0(3), tmp0 ADD R0(1), R0(2), tmp1 SUB R0(1) R0(2) tmp2SUB R0(1), R0(2), tmp2 SUB R0(0), R0(3), tmp3 ADD tmp0, tmp1, R4(0)

R4 = fTRAN (R0, mode) - mode 1 : 16x16 - mode 2 : Others

ADD tmp2, tmp1<<1, R4(1) SUB tmp0, tmp1, R4(2) SUB tmp2, tmp1<<1, R4(3)

mode 2 : Others


p , p , ( )

44 / 75

Performance comparisons

Deblocking filtering performanceLDW AX0, p r0= M(a0)

Edge Filtering

pLDW AX1, qLDW r1 #h’4LDW r2 #h’1LDW r3 #h’1222DOTPU4 r2, pDOTPU4 r3 q

( )r1=M(a1)r3=#h’4r4=hadd(r0:0011.0001)r5=hadd(r1:0111.0011)r4=hadd(r0:0011.0001)r5=hadd(r1:0111 0011)

Improves 20~25 % of deblockingFiltering

(66 %)

Others

DOTPU4 r3, qADD2 acc0,acc1ADD2 acc0, r1SHFL acc0 3PACK acc0STDW acc0

r5=hadd(r1:0111.0011)Acc0=r4+r5acc0=(acc0+r3)>>3M(a3)=acc0

Reduced 40 %

deblocking filtering performance

Integer transform performance

Others(34 %)

Deblocking filtering

15 instructions 9 instructions

64x Proposed Instruction

Reduced 40 %

TMS320c55x TMS320c55x TMS320c64x Proposed

Integer transform performance

SW HW SW ASIP

Required MIPS 12.8 2.8 1.0 1.2


Novel Instructions and Their Hardware Architecture for Video Signal Processing [IEEE ISCAS 2005].ASIP Approach for Implementation of H.264/AVC [Journal of Signal Processing Systems, Jan. 2008]

45 / 75

VSIP implementation

Compare FPGA board, C / Matlab, Lisa simulator

Forward Integer Transform

loop #16 lpR0=M(AR0,2) - - - copy pixels to register R1=M(AR0,2)R2=M(AR0,2)R3=M(AR0,2)

loop #2 ftran - - - loopRF1=trans(RF0) - - - transpose 4 x 4 matrixR0=ftran(R4,1) - - - 1D integer transformR1=ftran(R5,1)R2=ftran(R6,1)R3=ftran(R7,1)R3 ftran(R7,1)

ftran: - - - ftran loop endnopnopM(AR1,2)=R0 - - - store pixels to memoryM(AR1,2)=R1M(AR1 2)=R2M(AR1,2)=R2M(AR1,2)=R3

lp:

VSIP ME Chi < VSIP MC Chi >


< VSIP ME Chip> < VSIP MC Chip >

46 / 75

Further Research

ASIP for motion estimationAims to support various Motion Estimation (ME) algorithmsTry to find good balance between flexibility and performanceFunded by Samsung ElectronicsR h t iResearch topics

Reconfigurable Interconnection

Optimalprocessor

model

Reconfigurablearchitecture

Interconnectionbetween core andH/W accelerator

model

Development of ME ASIP Scalability of ME ASIP

Program


Programtemplates forME algorithms

Outline










Conclusions


ASIP for CommunicationsMSC8156 Processor - Freescale semiconductor

FeatureProvide flexibility integration and cost efficient for next generationProvide flexibility, integration and cost efficient for next generation wireless communication standards (3G-LTE, WiMAX, eHSPA, TDD-LTE, etc)S pport req irements of the ne t generation base stationSupport requirements of the next generation base station

High speed processing and decreasing latencySupport high data rates with up-to-date OFMDA (Orthogonal Frequency Division Multiple Access) standard

CLASSCLASS

SC3850 DSP CORESC3850 DSP CORESC3850 DSP CORE

32 KB L1 32 KB L1

SC3850 DSP CORE

32 KB L1 32 KB L1 Dual RISC Processors

MAPLE-B

32 KB L1I-Cache

32 KB L1D-Cache

512 KB L2 Cache/M2 Memory

32 KB L1I-Cache

32 KB L1D-Cache


32 KB L1I-Cache

32 KB L1D-Cache


32 KB L1I-Cache

32 KB L1D-Cache


Dual RISC Processors

DFT/IDFT

Turbo/Viterbi

FFT/IFFT CRC


ASIPs for Multimedia (Video)

SSD1933 Multimedia Processor - Solomon SystechF tFeatures

Dual core architecture with ARM926EJ-S and AV-DSPHigh quality multimedia for mobile multimedia device, navigation system, mobile internet device

Standard I/O

Connectivity

Humanf

CPU Subsystem

ARM926D-Cache

I-Cache

MultimediaAcceleration

2D GraphicInterface

Systemcontrol

Memoryf

Multimedia Subsystem

AV-DSP3D-DMA

L1-Cache

Engine

Pre and PostInterface

MultimediaInterface SRAMPRISM

Processing


ASIPs for Multimedia (Audio)

ZSP800 processor – VeriSiliconF tFeatures

Support Z.Turbo accelerator – users can add instructions and acceleratorHigh-definition audio DSP incorporates innovative features to provide the right balance between silicon cost and processing


ASIPs for Multimedia (Audio)

Z.Turbo accelerator of ZSP processorF tFeatures

User-definable, user-configurableEnables user to add own accelerator or co-processor

Accelerates special functions without burdening the main DSP core

M d t ffi i t th j t kiMore power and cost efficient than just cranking up MHz or just adding more execution units

Customers can differentiate using own designs g gon top of ZSP architecture


ASIP for FECFEC ASIP - IMEC

FeaturesThe world’s first decoding of Turbo code and LDPC in one processorThe world s first decoding of Turbo code and LDPC in one processorUsing multiprocessor with several SIMD architectures shows high performance and energy efficiencyHandling Scrambling of LDPC and Interleaving of turbo code with rAGU (reconfigurable Address Generation Unit)

Input/output

Inputfifo

Outputfifo

Input/outputinterface

AGU1

AGU2

BackgroundMem bank

AGU1

AGU2

BackgroundMem bank

AGU1

AGU2

BackgroundMem bank

Shuffler Shuffler Shuffler

Rotation engine

Rot

atio

nsu

port

Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF Aligned scratchpad

N-way SIMDpipline

VRF LIFO

ControlUnit

Program

SRF


VRF LIFOmem

Control interface

VRF LIFOmem VRF LIFOmem

53 / 75

ASIPs for MPSOC systemAachen Univ. - T.G Noll team

Reconfigurable ASIP architecture using eFPGA (embedded FPGA)More application specific architecture than typical FPGAMore application specific architecture than typical FPGASmall area and low power architecture - Optimize arithmetic operation

Performance update using program language like HDLUsing configurable block, the performance closed to ASIC with low cost and time

I t ti C fi tiInstructionMemory

Configurationmemory

eFPGA

Control unit

register

ASIP core


ASIPs for MPSOC systemAachen Univ. - H. Meyr team

Reconfigurable ASIP architecture using CGRA (Coarse Grained Reconfigurable Architecture)Reconfigurable Architecture)CGRA

Include arithmetic, logical operation or specific processing element)inside coreInstead of FPGA CGRA implement system using architecture inside the coreInstead of FPGA, CGRA implement system using architecture inside the coreAlso the reconfigurable block is application specific block

Although flexibility of CGRA is less then flexibility of FPGA, we can develop fast with low cost using application specific CGRAdevelop fast with low cost using application specific CGRA

z

+resistera

z

>>

configurable

by


g

CGRA – PE architecture55 / 75

ASIPs for MPSOC system

ASIP should be specialized for specific applicationASIP should be specialized for specific application

To optimize MPSOC systemTo optimize MPSOC system

Support the interface for communication among ASIPsSupport the interface for communication among ASIPs inside system

Guarantee compatibility among compilers

Need a low power architecture for mobile deviceNeed a low power architecture for mobile device



Architecture Description Language (ADL) based designArchitecture Description Language (ADL) based designMaximize flexibility and efficiency, but significant design effortLISATek (CoWare), IP Designer (Target), ASIP Meister (ASIP S l ti I )Solutions, Inc.)

Configurable Processor CoresUse pre-designed and pre-verified coreEfficiency via custom instruction set extensionsEfficiency via custom instruction set extensionsXtensa (Tensilica), CorExtend (MIPS), Configurable cores ARC600, ARC700 (ARC)


ADL based ASIP designLISATek Processor Designer – CoWare

Language for Instruction-set Architectures (LISA) is powerful g g ( ) prepresentative of instruction-set languageGenerate complete set of SW development tools including optimizing C-Compiler and fast instruction-set simulatorp g p


ADL based ASIP designIP Designer – Target Compiler Technologies

Retargetable tool-suitable for ASIP designg gDefine ASIP architecture in the nML language (hierarchical and highly structured architecture description language)


ADL based ASIP designASIP Meister – ASIP Solutions, Inc.

Generate dedicated processor hardware descriptions and software development tools automatically based on target specificationsOperations of instructions can be defined easily using the Micro Operation description language provided by ASIP Meisterp p g g p y


Configurable ASIPsXtensa LX3 - Tensilica

Architecture16bit or 32 bit multiplier, single 16 bit MAC16bit or 32 bit multiplier, single 16 bit MACSupport multiprocessorAdapt multi-issue VLIW using FLIX (Flexible Length Instruction eXtensions) architectureSelectable 5-stage or 7-stage optional pipelineConfigurable over a wide range of pre-verified options


Configurable ASIPsXtensa LX3 - Tensilica

XPRES compiler – featureAnalysis C/C++ source code and a run-time application profile to automaticallyAnalysis C/C source code and a run time application profile to automatically suggest configuration settings and new instructionsProvide a useful starting point for further optimization by the designer

XPRES compiler – design flow

A li ti d f ti l

Xtensa Processor Generatorbuilds complete optimized

hardware block and tool-chainin minutes

C/C++ source code

TIE :Designer-Defined

Instructions

ProcessorConfiguration

Input

Application code or functionalspecification in full C/C++ language

Analyze thousands of possiblefi ti i i t XPRES Compiler

TIE :Designer Defined

TIE :TIE :

Instructions Input

Xtensa Processor Generator

processor configurations in minutes

Optimally tune TIE or combine Designer-DefinedInstructions

Designer-DefinedInstructions

TIE :Designer-Defined

Instructions Hardware (RTL) System Models CompleteSoftware Tools

p ywith manually generated or

automatically generated TIE.Select optimal configuration


Configurable ASIPsCorExtend - MIPS

FeaturesAllow SoC designers to add proprietary instructions and tightlyAllow SoC designers to add proprietary instructions and tightly coupled hardwareAs many instructions as an expert designer needs can be addedMIPS32@4KE, M4K, 4KSd Pro, MIPS32@24K Pro, 24KEMIPS32@34K Pro, MIPS32@74K, MIPS32@1004K


Configurable ASIPsConfigurable cores ARC600, ARC700 - ARC

FeaturesEnable designers to add features they need and remove featuresEnable designers to add features they need and remove features they do not need for their individual applicationOffer the flexibility to add instructions, registers, flags and condition codes creating processor that is highl t ned for specific applicationcodes, creating processor that is highly tuned for specific application


Evolution of ASIPsFuture of ASIPs

Higher PerformanceHigher Performance

ASIPASIP

ReconfigurableReconfigurable More specificMore specificapplicationapplication

Low power Low power consumptionconsumption

High FlexibilityHigh Flexibility


Outline










Conclusions


Conclusions

Proposed three ASIPs for OFDM systems, Audio and Video combine high performance of ASIC and flexibilityVideo combine high performance of ASIC and flexibility of DSP

Smaller hardware size than existing DSPsSupport various standardsSupport various standards

ASIP Core for OFDM communication systemsSpecial instructions and hardware architectures for FFT and bit manipulationSupport various OFDM and DMT modem systemsSupport various OFDM and DMT modem systems

ASIP Core for AudioSpecial instructions for audio codingAccelerator for Huffman decodingAccelerator for Huffman decodingSupport various high quality audio codecs

ASIP Core for Video applicationsSpecial instructions for video codingSpecial instructions for video codingTwo coprosessors for ME/MC and VLCSupport various video Codecs


Implemented ASIPs

< SPOCS > < VSIP ME > < VSIP MC > < DASIP >

ASIP Specification

< SPOCS > < VSIP ME > < VSIP MC > < DASIP >

Library OperationFrequency

GateCounts

MemorySize

Remarks

SPOCS Sec 0.18㎛ 280MHz 107,000 12Kbyte -SPOCS Sec 0.18㎛ 280MHz 107,000 12Kbyte

VSIP HSI 0.25㎛ 160MHz 141,260 24Kbyte ME/MC hardware accelerator

DASIP Sec 0.18㎛ 200MHz 120,283 24Kbyte -


Implemented chips (1/2)

S fDSP forwireless

communication 40MH

MDSP (1st version)

30MHz

MDSP (2nd version)

60MHz

MDSP (3nd version)

60MHz40MHz 30MHz 60MHz

Multimedia DSP + Fixed Point DSP16 bits fixed point DSP

60MHz

Multimedia DSP Fixed Point DSPMobile multimedia communication

DCT(176 x 144) 168.64 fr/sBMA(352 240) 14 f /

16 bits fixed point DSPInstructions are

compatiblewith Motorola

DSP56100


BMA(352 x 240) 14 fr/sDSP56100

69 / 75

Implemented chips (2/2)

PRML ReadDOCSIS 2.0 WLAN modem chip LMDSChannel FilterCable modem IEEE 802.11 DOCSIS

RS+Viterbi FEC Parallel image S DCME


RS+Viterbi FECdecoder

gprocessor FFT processor

S-DCMERS decoder

70 / 75

DVB-S2 System Chip Design

ETRI – Ajou universityTRI Ajou university

SoC LabSoC Lab


DVB-S2 Receiver System Description

Standard of Satellite Digital Video BroadcastingCharacteristics

channel adaptive transmitter algorithm using ACM(Adaptive coding and modulation) and VCM (Variable coding and modulations)Important 3 Signal processing blocks

S h i Ti i d f h i d d d l tiSynchronizer : Timing and frequency synchronizer and demodulationsFEC : Error detection and correctionMode de-adaptation : Packet header decoding

DVB S2DVB-S2synchronizer(Ajou univ.)

FEC(LDPC+BCH)

ModeDe-adaptation

MODCOD

ADC Video signal

ADC : Analog Digital ConverterMODCOD : The code of modulation method and code-rate.BCH : Bose-Chaudhuri-Hocquenghem multiple error correction binary block code


DVB-S2 Synchronizer Description

DVB-S2 Synchronizer Descriptions STR : Using Gardner AlgorithmSTR : Using Gardner Algorithm.Frame Sync : Adopt correlation schemes GDPDIFreq Sync : Coarse, fine and phase estimation.SNR Estimator : Using SNV algorithmsReed-Muller decoder : MODCOD DecodingDemapper : QPSK, 8PSK, 16APSK, 32APSK demodulations

STR AGC Frame Sync.

SNR EstimatorADC

frame done

y

Descrambler Freq. Sync

Demapper

frame done SNR

Sync.

Reed-Muller

STR: Symbol timing recoveryAGC: Automatic Gain ControllerGDPDI: differential generalized post detection integration

MODCOD


DecoderGDPDI: differential generalized post-detection integrationSNR: Squared Signal-to-Noise Variance

DVB-S2 Test Environments

Test EnvironmentsUse VCM (Variable coding modulation)

QPSK : code rate : 1/28PSK : code rate : 2/3

SNR : 6dBSample rate : 9Msymbol/sCarrier Frequency : 21Ghz.


DVB-S2 Test Movie


Papers and Patents list

Papers[1] L l it d d ME d i t l ti l ith f H 264/AVC[1] Low power complexity-reduced ME and interpolation algorithms for H.264/AVC,

Jour. of signal proc., 2009[2] SPOCS : Application specific signal processor for OFDM communication systems,

Jour of signal proc 2008Jour. of signal proc., 2008[3] ASIP Approach for implementation of H.264/AVC, Jour. Of signal proc., 2008[4] Novel intra prediction algorithm using residual prediction for low power

lti di d ISIC2009multimedia codecs,ISIC2009[5] Efficient integer motion estimation algorithm using sub-sampling, IEEE

ISOCC2009[6] Novel residual prediction scheme for hybrid video coding, IEEE ICIP2009[7] Novel frame selection methods for multi-reference motion estimation,

International Conference on Digital Signal Processing 2009[8] Efficient frame selection schemes for multi-reference and variable block Size

Motion Estimation, IEEE ICME2008[9] Novel fractional pixel motion estimation algorithm using motion prediction and


fast search pattern, IEEE ICME2008


Papers[10] I d f l t f M lti R f ti ti ti IEEE ISOCC2008[10] Improved frame selector for Multi-Reference motion estimation, IEEE ISOCC2008[11] Fast multiple reference frame selection method For H.264/AVC, IEEE WSPS2008[12] Fast full search motion estimation algorithm using MNPDS, IEEE ICEIC2008[13] Power efficient integrated motion compensator for MPEG and H.264/AVC, IEEE

SLPHSC2008[14] Three low power ASIP processor designs for communications, video, and audio [ ] p p g , ,

applications, DTIS 2007[15] An ASIP approach for H.264/AVC implementation having novel coprocessors,

SIPS 2007[16] Low power ASIC architecture optimization based on target application profiling,

IEEE SCS2007[17] Novel non-linear inverse quantization algorithm and its architecture for digital [ ] q g g

audio codecs, ISCAS 2007[18] VSIP : Implementation of video specific instruction-set processor, IEEE APCCAS

2006



Papers[19] VSIP: Video specific instruction set processor for H.264/AVC, IEEE SIPS 2006[20] ASIP approach for implementation of H.264/AVC, IEEE ASP-DAC2006[21] Efficient memory reuse and sub-pixel interpolation algorithms for ME/MC of [ ] y p p g

H.264/AVC, IEEE SIPS 2006[22] Efficient motion estimation accelerator for H.264/AVC, A-SSCC 2006[23] ASIP instructions and their hardware architecture for H.264/AVC, ISOCC 2005[23] ASIP instructions and their hardware architecture for H.264/AVC, ISOCC 2005[24] Implementation of application-Specific DSP for OFDM Systems, IEEE international

Symposium on Antennas and Propagation 2005[25] Application-specific DSP architecture for H 264/AVC ITC-CSCC 2005[25] Application-specific DSP architecture for H.264/AVC, ITC-CSCC 2005[26] Reconfigurable coprocessor for communication systems, ITC-CSCC 2005[27] Design of a high-quality audio-specific DSP core, Best Paper Award in IEEE SIPS

20052005[28] Novel instructions and their hardware architecture for video signal processing,

IEEE ISCAS 2005[29] I l i f li i ifi i l f hi h d OFDM


[29] Implementation of application-specific signal processor for high-speed OFDM Systems, COOL Chips Ⅷ


Papers[30] I l i f i l l i di DSP hi f bil li i J[30] Implementation of a wireless multimedia DSP chip for mobile applications, Jour.

of VLSI signal proc., 2005[31] Digital signal processor architecture with bit manipulation accelerator for

communication systems EURASIP JASP 2005communication systems, EURASIP JASP, 2005[32] Implementation of a wireless multimedia DSP chip for mobile applications, Jour.

of VLSI signal proc., 2005S f / C[33] ASIP Instructions and their hardware architecture for H.264/AVC, Journal

Semiconductor Technology and Science, 2005[34] Audio-Specific Signal Processor (ASSP) for High-Quality Audio Codec, A-SSCC

20052005[35] Implementation of application-specific signal processor for high-speed

communication systems, ISPACS 2004[36] Design of reconfigurable coprocessor for communication Systems, SIPS 2004[37] Implementation of application-specific DSP for OFDM systems, ISCAS2004[38] Design of new DSP instructions and their hardware architecture for high-speed


FFT, Jour. of VLSI signal proc., 2003


PatentsPatents[1] Computing Circuits and Method for Running an MPEG-2 AAC or MPEG-4 AAC

Audio Decoding Algorithm on Programmable Processors, US patents[2] Frequency error estimator and frequency error estimating method thereof US[2] Frequency error estimator and frequency error estimating method thereof, US

patents[3] Modulation apparatus using mixed-radix fast Fourier transform, US patents[4] Bit i l ti ti i it d th d i bl US[4] Bit manipulation operation circuit and method in programmable processor, US

patents[5] Apparatus and method for computing an FFT in a programmable processor,

European patentsEuropean patents [6] FFT operating apparatus of programmable processors and operation method

thereof, US/European patents[7]M d l ti t i i d di f t F i t f J t t[7]Modulation apparatus using mixed-radix fast Fourier transform, Japan patents[8] Computing circuits and method for running an MPEG-2 AAC or MPEG-4 AAC

audio decoding algorithm on programmable processors, Korea patents



PatentsPatents[9] Reducing decoding complexity method and devices for low density parity

check, Korea patents[10] Frequency error estimator and frequency error estimating method thereof[10] Frequency error estimator and frequency error estimating method thereof,

Korea patents[11] Frame synchronization circuit in DVB-S2, Korea patents[12] R f f l ti th d f lti f ti ti ti f[12] Reference frame selection method for multi-reference motion estimation of

high performance multimedia codec, Korea patents[13] S-DCME algorithm processing Methods and Circuits for Reed-Solomon

decoder Korea patentsdecoder, Korea patents


Thank you!Thank you!


Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History...

Documents

Transcript of Piil bl ltiPromising low power reusable solutions: Apppp ... · Digital Signal Processors History...